{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Application of Python to biostatistics\n",
    "\n",
    "1. General biostatistics\n",
    "\n",
    "    1. Loading data\n",
    "    2. Representation of data (population)\n",
    "    3. Sampling\n",
    "    4. Representation of data (sample)\n",
    "    5. Statistical tests\n",
    "    \n",
    "2. Quantitative genetics\n",
    "\n",
    "    1. Marker association\n",
    "    2. Logarithm of odd ratio\n",
    "    3. Allelic association\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 1. General biostatistics\n",
    "\n",
    "In this practical, we are going to perform basic operations of statistics using Python: loading data from a file, doing plots, calculating statistical metrics such as mean and perform statistical tests. We will later do statistics in the context of statistical genetics. \n",
    "\n",
    "** Genetics of hypertension **\n",
    "\n",
    "In this exercise, we are going to study data related to the genetics basis of hypertension in rats. Hypertension is medical condition that involves sustained increase of blood pressure following stress and excessive life style. Our biological question is to know if we can find a genetic basis for hypertension. In section 1, we will perform some basic statistic operations, while section 2 will deal with studying the association of hypertension with genetic markers itself. \n",
    "\n",
    "#### A. Loading data\n",
    "\n",
    "##### Files required\n",
    "In your jupyter folder, copy the following files: \n",
    "- phenos.txt\n",
    "- geno_chr1.txt\n",
    "- map_chr1.txt\n",
    "\n",
    "Command **with** is interesting because it allows to execute a command and continue to run the program even if this command was produce an error. It is useful when you open a file because if the file you want to open is not found, the program will continue to run. \n",
    "\n",
    "```python\n",
    "import csv\n",
    "with open('phenos.txt', 'r') as f:\n",
    "    reader = csv.reader(f,delimiter='\\t')\n",
    "    phenos = list(reader)\n",
    "with open('geno_chr1.txt', 'r') as f:\n",
    "    reader = csv.reader(f,delimiter='\\t')\n",
    "    genos = list(reader)\n",
    "```\n",
    "##### First look at the data:\n",
    "\n",
    "It is usually a good idea to have a look at the first items of the lists or array that you create to be sure that things happened the way they should. Other operations that help to know if the data are loaded correctly is to compute the length of the data and a summary that gives an idea of the distribution of the data (including mean, quartiles, and median). The best of all is to do plots. \n",
    "\n",
    "Here, we are going to load two tables: one containing the phenotypes (phenos.txt), which are the values blood pressure measure and the gender of the rats involved (all males), and one containing the genotypes (geno_chr1.txt). Labels such as D1mit13 are the identifiers of the different markers. '1' and '2' are the two alleles found at the genetic marker position. \n",
    "\n",
    "```python\n",
    "print('phenotypes:')\n",
    "print(phenos[0:10])\n",
    "print(len(genos))\n",
    "\n",
    "print('genotypes:')\n",
    "print(genos[0:10])\n",
    "print(len(genos))\n",
    "```\n",
    "##### Questions:\n",
    "What are the dimensions of the two datasets? How many columns and rows in the corresponding tables? \n",
    "How many markers are there in the table? Write a command to display their names. \n",
    "\n",
    "```python\n",
    "print('dimensions of genotype table:')\n",
    "print([len(genos),len(genos[0])])\n",
    "print('dimensions of phenotype table:')\n",
    "print([len(phenos),len(phenos[0])])\n",
    "print('number of markers:')\n",
    "print(len(genos[0]))\n",
    "print(genos[0])\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### B. Exploration of the data\n",
    "\n",
    "##### Importing libraries\n",
    "\n",
    "numpy : for operations on numbers. scipy : will help perform certain computations such as statistical tests. \n",
    "matplotlib and seaborn: plotting data\n",
    "\n",
    "```python\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "from scipy import stats, integrate\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "```\n",
    "##### Computation of the mean  of the population\n",
    "\n",
    "We first transform the data so we have a simple list of floating numbers and we apply the mean function of the library statistics:\n",
    "\n",
    "```python\n",
    "bp = [] #creates a list where we will place the values of blood pressure\n",
    "for k in range(1, 251):\n",
    "    tmp = float(phenos[k][0])\n",
    "    bp.append(tmp)\n",
    "\n",
    "import statistics\n",
    "\n",
    "mn = statistics.mean(bp)\n",
    "print(mn)\n",
    "\n",
    "```\n",
    "\n",
    "##### Question\n",
    "Create a function **my_mean** to compute the mean. Apply the function to your dataset (variable **bp**). Use **return** command to return the result of the computation. \n",
    "\n",
    "```python\n",
    "def my_mean(list):\n",
    "    sumi = 0\n",
    "    for i in range(0,len(list)):\n",
    "        sumi = list[i] + sumi\n",
    "    return sumi/len(list)\n",
    "\n",
    "print(my_mean([1,2,3]))\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "##### Density distribution\n",
    "\n",
    "Here we are going to plot the density distribution of the values of blood pressure. The curve is based on a histogram of the data so for any value or range of values, this plot gives you the amount of data that can be found in the dataset. \n",
    "\n",
    "```python\n",
    "sns.set(color_codes=True)\n",
    "sns.distplot(phenos[1:250][0], kde=False, rug=True)\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "\n",
    "##### Computation of the standard deviation of the population\n",
    "\n",
    "```python\n",
    "sd = statistics.pstdev(bp)\n",
    "print(sd)\n",
    "```\n",
    "\n",
    "##### Questions\n",
    "Create a function **my_sd** to calculate the standard deviation of the population. Apply the function to your dataset (variable **bp**). Use function **my_mean** to compute the mean. \n",
    "\n",
    "```python\n",
    "def my_sd(list):\n",
    "    sumi = 0\n",
    "    mn = my_mean(list)\n",
    "    for i in range(0,len(list)):\n",
    "        sumi = (list[i] - mn)**2 + sumi\n",
    "    return (sumi/len(list))**(1/2)\n",
    "    #return math.sqrt(sumi/len(list))\n",
    "\n",
    "print(my_sd([1,2,3]))\n",
    "print(statistics.pstdev([1,2,3]))\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "##### Density distribution\n",
    "\n",
    "Here we are going to plot the density distribution of the values of blood pressure. The curve is based on a histogram of the data so for any value or range of values, this plot gives you the amount of data that can be found in the dataset. \n",
    "We also plot a normal law fitting the data. \n",
    "\n",
    "```python\n",
    "sns.set(color_codes=True)\n",
    "f, axes = plt.subplots(1, sharex=True, sharey=True)\n",
    "sns.distplot(bp, bins=20,kde=False, rug=True)\n",
    "axes.axvline(x=mn, color='k')\n",
    "\n",
    "#another way to plot:\n",
    "f, axes = plt.subplots(1, sharex=True, sharey=True)\n",
    "sns.distplot(bp, bins=20,hist=False, rug=True);\n",
    "#trying to fit to a normal law or a gamma law:\n",
    "axes.axvline(x=mn, color='k')\n",
    "sns.distplot(bp, kde=False, fit=stats.norm);\n",
    "```\n",
    "\n",
    "##### Question: \n",
    "How do you think this normal law fitting to the data is generated?  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### C. Sampling\n",
    "\n",
    "Here we are going to generate random samples of our population of rats by using library **random**. \n",
    "\n",
    "```python\n",
    "import random\n",
    "s1 = random.sample(bp[1:251],10)\n",
    "print(s1)\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "```python\n",
    "\n",
    "import random\n",
    "s1 = random.sample(bp[1:251],8) \n",
    "s2 = random.sample(bp[1:251],60)\n",
    "\n",
    "f, axes = plt.subplots(1, sharex=True, sharey=True)\n",
    "sns.distplot(bp,kde=False, fit=stats.norm);\n",
    "axes.axvline(x=mn, color='b')\n",
    "sns.distplot(s1,kde=False, fit=stats.norm);\n",
    "mn1 = statistics.mean(s1)\n",
    "axes.axvline(x=mn1, color='g')\n",
    "\n",
    "f, axes = plt.subplots(1, sharex=True, sharey=True)\n",
    "sns.distplot(bp,kde=False, fit=stats.norm);\n",
    "axes.axvline(x=mn, color='b')\n",
    "sns.distplot(s2, kde=False, fit=stats.norm);\n",
    "mn2 = statistics.mean(s2)\n",
    "axes.axvline(x=mn2, color='g')\n",
    "\n",
    "#sns.distplot(s2,  kde=False, fit=stats.norm);\n",
    "\n",
    "s3 = random.sample(bp[1:251],8)\n",
    "s4 = random.sample(bp[1:251],8)\n",
    "s5 = random.sample(bp[1:251],8)\n",
    "f, axes = plt.subplots(1, sharex=True, sharey=True)\n",
    "sns.distplot(bp,hist=False, kde=False, fit=stats.norm);\n",
    "axes.axvline(x=mn, color='b')\n",
    "sns.distplot(bp, kde=False, fit=stats.norm);\n",
    "sns.distplot(s3,hist=False, kde=False, fit=stats.norm);\n",
    "sns.distplot(s4,hist=False, kde=False, fit=stats.norm);\n",
    "sns.distplot(s5,hist=False, kde=False, fit=stats.norm);\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    " #### D. Statistical tests\n",
    " We are going to test whether other samples (s1, s2 etc.) are significantly different from the mean of the population (bp). \n",
    " \n",
    " ##### Question\n",
    " What is your null hypothesis H0? What would a p-value above 0.05 tell you? What would a p-value below 0.05 tell you? \n",
    "Do the tests below and give your conclusion each time. Was the conclusion different. Try to draw samples with very low amount of individuals (samples of size 3 for instance) and samples with high number of individuals (100 for instance) and see if the conclusion changes with samples size.  \n",
    "\n",
    " ```python\n",
    " import scipy\n",
    "print(scipy.stats.ttest_ind(bp, s1))\n",
    "print(scipy.stats.ttest_ind(bp, s2))\n",
    "print(scipy.stats.ttest_ind(bp, s3))\n",
    "print(scipy.stats.ttest_ind(bp, s4))\n",
    "print(scipy.stats.ttest_ind(bp, s5))\n",
    "```\n",
    "##### Question\n",
    "Create a function to calculate the statistics t of the t-test and check that the value is identical to the one given by the **stats.ttest_ind** function. \n",
    "\n",
    "```python\n",
    "import math\n",
    "def my_ttest(list1,list2):\n",
    "    n1 = len(list1)\n",
    "    n2 = len(list2)\n",
    "    sp = math.sqrt( ((n1-1)*(my_sd(list1)**2) + (n2-1)*(my_sd(list2)**2)) / (n1 + n2 -2) )\n",
    "    t = (my_mean(list1) - my_mean(list2)) / (sp * math.sqrt((1/n1) + (1/n2)))\n",
    "    return t\n",
    "\n",
    "print(my_ttest(bp, s1))\n",
    "print(scipy.stats.ttest_ind(bp, s1))\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 2. Quantitative genetics\n",
    "\n",
    "#### A. Marker association\n",
    "\n",
    "##### Question\n",
    "\n",
    "Create two lists **geno3** and **geno5**, one containing the data for marker D1mit156 and one containing the data for marker D1mit19. Each list should be of length 250. \n",
    "\n",
    "```python\n",
    "print(genos[0])\n",
    "geno3 = []\n",
    "geno5 = []\n",
    "for k in range(1, 251):\n",
    "    tmp = genos[k][2] # using a list of lists, ones needs to select each k item for the column considered (here, column 3 - which is item number 2 as there is also a column 0 - containing marker D1mit126\n",
    "    geno3.append(tmp)\n",
    "    tmp = genos[k][4]\n",
    "    geno5.append(tmp)\n",
    "print(genos[3])\n",
    "print(genos[5])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "We first want to represent the data using simple 1D plots:\n",
    "\n",
    "```python\n",
    "print(genos[0])\n",
    "plt.plot(geno5, bp, marker='o', linestyle='') # first marker against blood pressure level\n",
    "```\n",
    "\n",
    "We then calculate means and perform a t-test between the distribution of blood pressure of two alleles of the same marker. This will tell us if the genotype is significantly associated with blood pressure. \n",
    "\n",
    "```python\n",
    "# mean of genotype 1\n",
    "mn1 = []\n",
    "mn2 = []\n",
    "#print(geno5)\n",
    "for k in range(0, 250):\n",
    "    if geno5[k] == '1' :\n",
    "       mn1.append(bp[k])  \n",
    "    if geno5[k] == '2' :\n",
    "       mn2.append(bp[k])  \n",
    "print(statistics.mean(mn1))\n",
    "print(statistics.mean(mn2))\n",
    "\n",
    "# t.test\n",
    "scipy.stats.ttest_ind(mn1, mn2)\n",
    "\n",
    "```\n",
    "##### Questions\n",
    "what is your H0? \n",
    "Do you find a relationship between genotype and phenotype for this marker? \n",
    "\n",
    "Create a function **marker_assoc** to perform the test for all markers and report only the ones which were significant. \n",
    "\n",
    "```python\n",
    "def marker_assoc(list):\n",
    "    statm = []\n",
    "    for i in range(0, 22):\n",
    "        p1 = []\n",
    "        p2 = []\n",
    "        for j in range(1, 249):\n",
    "            if genos[j][i] == '1' :\n",
    "                p1.append(bp[j])  \n",
    "            if genos[j][i] == '2' :\n",
    "                p2.append(bp[j])  \n",
    "        tt = scipy.stats.ttest_ind(p1, p2)\n",
    "        statm.append(tt)\n",
    "    return statm\n",
    "\n",
    "print(marker_assoc(genos))\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### B. Logarithm of odd ratio (LOD score)\n",
    "\n",
    "```python\n",
    "# calculate recombination fraction (\"theta\") for D1mit156 and D1mit19.\n",
    "with open('map_chr1.txt', 'r') as f:\n",
    "    reader = csv.reader(f,delimiter='\\t')\n",
    "    map = list(reader)\n",
    "\n",
    "import math\n",
    "import numpy\n",
    "\n",
    "# Calculation of Haldane's recombination fraction (\"theta\"): P(recombination) = P(X odd) = (1/2) (1- exp(-2d)).\n",
    "theta = 0.5*(1-math.exp(-2*(float(map[4][0])-float(map[2][0]))))\n",
    "\n",
    "```\n",
    "##### Question\n",
    "What is the expected proportion of each combination of genotype (marker1:1-marker2:1; m1:1-m2:2; m1:2-m2:1; m1:2-m2:2) following the breeding of rats if there was no recombination? Compute the number of possible recombinant genotypes that destabilize these proportions\n",
    "\n",
    "m1:1-m2:1 25%\n",
    "m1:1-m2:2 25%\n",
    "m1:2-m2:1 25%\n",
    "m1:2-m2:2 25%\n",
    "\n",
    "```python\n",
    "m11m21=0\n",
    "m12m21=0\n",
    "m12m22=0\n",
    "m11m22=0\n",
    "for k in range(0, 250):\n",
    "    if geno5[k] == '1' and geno3[k] == '1':\n",
    "       m11m21=m11m21+1\n",
    "    if geno5[k] == '2' and geno3[k] == '2':\n",
    "       m12m22=m12m22+1\n",
    "    if geno5[k] == '1' and geno3[k] == '2':\n",
    "       m11m22=m11m22+1\n",
    "    if geno5[k] == '2' and geno3[k] == '1':\n",
    "       m12m21=m12m21+1\n",
    "\n",
    "print(m11m21)\n",
    "print(m12m21)\n",
    "print(m12m22)\n",
    "print(m11m22)\n",
    "\n",
    "#111\n",
    "#5\n",
    "#117\n",
    "#16\n",
    "# with 5 the minimal number, we can decide that a 5+5+5+5=20 are non-recombinant (NR) and 106+112+11 are recombinant (R). \n",
    "NR = 20\n",
    "R = 229\n",
    "\n",
    "# Computation of LOD score: \n",
    "lod = numpy.log10( ((1-theta)**NR * theta**R) / (0.5**(NR+R)) )\n",
    "print(lod)\n",
    "\n",
    "```\n",
    "##### Question\n",
    "Create a loop to do the computation for all the pairs n,n+1, eventually using a function **LOD_score**. Classify the results according to value of LOD score. What is the best value you get? \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### C. Allelic association\n",
    "\n",
    "Here, we need to divide the data into several categories in order to build the tables of observed and expected values. The first dimension will be the genotype, with values '1' or '2'. Second dimension will be the phenotype, i.e. the blood presure. To achieve the table, we will divide the data on blood pressure into three categories: value lower than 95, equal or greater to 95 and lower than 105, and finally the values equal or greater than 105. \n",
    "\n",
    "```python\n",
    "import math\n",
    "from scipy.stats import chisquare\n",
    "\n",
    "# observed distribution\n",
    "\n",
    "mat = [[0, 0,0], [0, 0, 0]]\n",
    "print(mat[0][0])\n",
    "cnt = 0\n",
    "for k in range(0, 250):\n",
    "    if bp[k] < 95:\n",
    "        if geno3[k] == '1':\n",
    "            mat[0][0] = mat[0][0] + 1\n",
    "        if geno3[k] == '2':\n",
    "            mat[1][0] = mat[1][0] + 1\n",
    "    if (bp[k] > 95 or  bp[k] == 95) and bp[k] < 105 :\n",
    "        if geno3[k] == '1':\n",
    "            mat[0][1] = mat[0][1] + 1\n",
    "        if geno3[k] == '2':\n",
    "            mat[1][1] = mat[1][1] + 1\n",
    "    if bp[k] > 105 or bp[k] == 105 :\n",
    "        if geno3[k] == '1':\n",
    "            mat[0][2] = mat[0][2] + 1\n",
    "        if geno3[k] == '2':\n",
    "            mat[1][2] = mat[1][2] + 1\n",
    "#    cnt = cnt + 1\n",
    "\n",
    "cnt = mat[0][0] + mat[1][0] + mat[0][1] + mat[1][1] + mat[0][2] + mat[1][2]        \n",
    "\n",
    "print(mat)\n",
    "print(cnt)\n",
    "```\n",
    "##### Question\n",
    "\n",
    "Build the matrix **matE** of expected values. For this, first define two list: one with the values of the sum of columns, and one with the values of the sum of rows. Then build the matrix of expected values: for each value i,j, multiply the sum of the line by the sum of column and divide by sum of all items, i.e. 250. \n",
    "\n",
    "```python\n",
    "\n",
    "# expected distribution\n",
    "# matrix with total of the lines and total of the columns\n",
    "sumcol = [mat[0][0] + mat[1][0], mat[0][1] + mat[1][1], mat[0][2] + mat[1][2]]\n",
    "sumrow = [mat[0][0] + mat[0][1] + mat[0][2], mat[1][0] + mat[1][1] + mat[1][2]]\n",
    "#print(sum(sumcol))\n",
    "matE = [[sumrow[0] * sumcol[0] / sum(sumcol),\n",
    "         sumrow[0] * sumcol[1] / sum(sumcol),\n",
    "         sumrow[0] * sumcol[2] / sum(sumcol)],\n",
    "         [sumrow[1] * sumcol[0] / sum(sumcol),\n",
    "         sumrow[1] * sumcol[1] / sum(sumcol),\n",
    "         sumrow[1] * sumcol[2] / sum(sumcol)]]\n",
    "print(matE)\n",
    "```\n",
    "##### Pearson Chi-Square test\n",
    "```python\n",
    "# validation with the imbed chisquare function of the math library:\n",
    "chisquare(mat,matE, axis=None)\n",
    "#help(chisquare)\n",
    "```\n",
    "\n",
    "##### Questions\n",
    "What do the outputs of the test mean? What do you conclude regarding the possible association of the two genes you are considering with the phenotype? \n",
    "Create a function to compute the Chi-square statistics X2. Compare to the value given by the chisquare function used previously.\n",
    "\n",
    "```python \n",
    "# def chisqu(table):\n",
    "X2 = 0\n",
    "print(len(mat))\n",
    "for i in range(0,len(mat)):\n",
    "    for j in range(0,len(mat[i])):\n",
    "        X2 = X2 + ((mat[i][j] - matE[i][j])**2)/ matE[i][j]\n",
    "print(X2)\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}