{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "IPython.notebook.set_autosave_interval(120000)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Autosaving every 120 seconds\n" ] } ], "source": [ "%autosave 120\n", "import numpy as np\n", "import pandas as pd\n", "%matplotlib notebook\n", "import matplotlib.pyplot as plt\n", "from IPython.display import display, Markdown, clear_output\n", "import ipywidgets as widgets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian reasoning\n", "\n", "Through these exercices, we will see how we can make probabilistic statements using the Bayes' rule. These examples will allow us to work on the concepts of prior, likelihood and posterior probabilities." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1 (Bayesian Data Analysis, Gelman et al, Chapter 1)\n", "\n", "Humans male have one X-chromosome and one Y-chromosome, wheras females have two X-chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits the gene that causes the disease on the X-chromosome is affected, while a female carrying the gene on only one of her two X-chromosomes is not affected.\n", "\n", "Let's consider a woman who has an affected brother and an unaffected father. This implies that her mother carries the hemophilias gene with one \"good\" and one \"bad\" gene. Let's consider the random variable $\\theta$ describing the state of the woman (carrier or not carrier). \n", "\n", "**1) Give the prior distribution of $\\theta$.**\n", "\n", "The unknown quantity of interest, the state of the woman, has two values. Either she's a carrier ($\\theta = 1$) or she's not ($\\theta = 0$). Based on the information provided so far, the **prior distribution** for $\\theta$ can be expressed as Pr($\\theta=0$)=Pr($\\theta=1$)= $\\frac{1}{2}$.\n", "\n", "**2)** We are told that the woman has two sons, neither of whom is affected. We consider the random variable $y_i = 1, 0$ which denotes if the son number i is affected or not. The outcomes of the two sons are exchangeable, and conditional on the unknown $\\theta$ are indepedent. We'll denote the data $(y_1, y_2)$ as $y$.\n", "\n", "**Relying on this information, derive the posterior probability of the woman being affected.**\n", "\n", "This problem amounts at inferring the quantity Pr($\\theta$=1|y).\n", "\n", "\n", "\\begin{align}\n", "Pr(y_1 = 0, y_2 = 0 | \\theta=0) & = Pr(y_1 = 0 | \\theta = 0)Pr(y_2 = 0 | \\theta = 0) = 1*1= 1 \\\\\n", "Pr(y_1 = 0, y_2 = 0 | \\theta=1) & = Pr(y_1 = 0 | \\theta = 1)Pr(y_2 = 0 | \\theta = 1) = 0.5*0.5= 0.25\n", "\\end{align}\n", "\n", "\n", "We use Bayes' rule to combine the information given by the data and the prior probability in order to update our knowledge on the state of the woman. We are interested in the probability of the woman to be a carrier. \n", "\n", "\n", "\\begin{align}\n", "Pr(\\theta=1 | y) & = \\frac{Pr(y|\\theta=1)Pr(\\theta=1)}{Pr(y)} \\\\\n", "& = \\frac{Pr(y|\\theta=1)Pr(\\theta=1)}{Pr(y | \\theta = 0)Pr(\\theta = 0) + Pr(y | \\theta = 1)Pr(\\theta = 1)} \\\\\n", "& = \\frac{0.25*0.5}{1*0.5 + 0.25*0.5} \\\\\n", "& = 0.20\n", "\\end{align}\n", "\n", "\n", "Intuitively, we could imagine that if a woman has two unaffected sons, it is less probable that she's a carrier of the gene. Bayes' rule allowed us here to formalize this idea, and also quantify the correction to apply to our prior knowledge.\n", "\n", "**3)** Let's suppose that the woman has a third son who is also unaffected. \n", "\n", "**What is the new posterior probability $Pr(\\theta=1|y_1, y_2, y_3)$ ?**\n", "\n", "\n", "\n", "\\begin{align}\n", "Pr(\\theta=1|y_1, y_2, y_3) & = \\frac{Pr(y_1, y_2, y_3, \\theta=1)}{Pr(y_1,y_2,y_3)} \\\\\n", "& = \\frac{Pr(y_3| \\theta=1, y_1, y_2)Pr(\\theta=1|y_1, y_2)Pr(y_1, y_2)}{Pr(y_3|y_1, y_2)Pr(y_1,y_2)} \\\\\n", "& = \\frac{Pr(y_3| \\theta=1)Pr(\\theta=1|y_1, y_2)}{Pr(y_3 | y_1, y_2)} \\\\ \n", "& = \\frac{Pr(y_3| \\theta=1)Pr(\\theta=1|y_1, y_2)}{Pr(y_3 | \\theta = 0, y_1, y_2)Pr(\\theta = 0 | y_1, y_2) + Pr(y_3 | \\theta = 1, y_1, y_2)Pr(\\theta = 1 | y_1, y_2)} \\\\\n", "& = \\frac{0.5*0.20}{1*0.8 + 0.5*0.2} \\\\\n", "& = 0.111\n", "\\end{align}\n", "\n", "\n", "Through this example, we saw that one of the advantages of Bayesian reasoning is that it allows to perform analysis on sequential data. In this case, we didn't need to redo all the calculation, but we rather used the posterior distribution as our new prior." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2 (Bayesian Data Analysis, Gelman et al, Chapter 1)\n", "\n", "Approximately 1/125 of all births are fraternal twins and 1/300 of all births are identical twins. Elvis Presley had a twin brother (who died at birth). \n", "\n", "**What is the probability that Elvis was an identical twin ? We will approximate the probability of a boy or a girl birth as 1/2.**\n", "\n", "In this exercise, we are interested in the quantity Pr(Identical twin | Twin brothers). We can use the fact that Elvis had a twin brother, in order to update our prior information. Deriving Bayes' rule we can write:\n", "\n", "\n", "\\begin{align}\n", "Pr(\\textrm{Identical twin | Twin brothers}) = \\frac{Pr(\\textrm{Twin brothers | Identical Twin})Pr(\\textrm{Identical Twin})}{Pr(\\textrm{Twin brothers})}\n", "\\end{align}\n", "\n", "\n", "We have:\n", "\n", "\n", "\\begin{align}\n", "Pr(\\textrm{Twin brothers}) & = Pr(\\textrm{Twin brothers, Identical Twins}) + Pr(\\textrm{Twin brothers, Fraternal Twins}) \\\\\n", "& = Pr(\\textrm{Twin brothers | Identical Twins})Pr(\\textrm{Identical Twins}) + Pr(\\textrm{Twin brothers | Fraternal Twins})Pr(\\textrm{Fraternal Twins}) \\\\\n", "& = 0.5*\\frac{1}{300} + 0.25*\\frac{1}{125} \\\\\n", "& = 0.0037\n", "\\end{align}\n", "\n", "\n", "So we obtain:\n", "\n", "\n", "\\begin{align}\n", "Pr(\\textrm{Twin brothers}) = \\frac{0.5*\\frac{1}{300}}{0.0037} = 0.45\n", "\\end{align}\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3 (Bayesian Data Analysis, Gelman et al, Chapter 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An early study on *placenta previa*, a condition of pregnancy, found that on a sample of 980 births, 437 were females. We also know that the proportion of female births in the general population is of 0.485. We will denote by $\\theta$ the probability of a female birth when the mother is suffering from *placenta previa*. We will assume a prior distribution p($\\theta$) = Beta($\\alpha$, $\\beta$). \n", "\n", "**1) Write the data likelihood.**\n", "\n", "Let denote by y the random variable describing the number of female births in the observed population. We assume these births to be independent and it is natural to consider a binomial distribution for y|$\\theta$.\n", "Thus the data likelihood is given by: p(y=k|$\\theta$) = $\\binom{n}{k}\\theta^{k}(1-\\theta)^{n-k}$, with n=980 and k=437.\n", "\n", "**2) Give the posterior probability of the number of births $\\theta$ (up to a constant).**\n", "\n", "We want to infer the quantity p($\\theta$|y). By making use of Bayes' rule we have that:\n", "\n", "\n", "\\begin{align}\n", "p(\\theta|y) & \\sim p(y|\\theta)p(\\theta) \\\\\n", "& \\sim \\theta^{\\alpha+k-1}(1-\\theta)^{n+\\beta-k-1} \\\\\n", "& \\sim Beta(\\alpha+k, n+\\beta-k)\n", "\\end{align}\n", "\n", "\n", "**3) How much evidence this data provide for the claim that the proportion of female births is below 0.485, the proportion of females in the general population ? You'll be summarizinfg information about the posterior distribution using statistics such as the median or posterior intervals.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " alpha beta prior mean post median \\\n", "0 1 1 0.500000 0.446185 \n", "1 1 5 0.166667 0.445334 \n", "2 1 10 0.090909 0.440611 \n", "3 1 50 0.019608 0.424442 \n", "4 5 1 0.833333 0.449191 \n", "5 5 5 0.500000 0.446806 \n", "6 5 10 0.333333 0.442896 \n", "7 5 50 0.090909 0.427012 \n", "8 10 1 0.909091 0.449772 \n", "9 10 5 0.666667 0.447829 \n", "10 10 10 0.500000 0.446870 \n", "11 10 50 0.166667 0.429480 \n", "12 50 1 0.980392 0.471718 \n", "13 50 5 0.909091 0.469706 \n", "14 50 10 0.833333 0.469860 \n", "15 50 50 0.500000 0.451621 \n", "\n", " 95% post CI \n", "0 [0.42057100019342697, 0.471042786834759] \n", "1 [0.41905288913682137, 0.4701515101661711] \n", "2 [0.4168034968523966, 0.46926431026855464] \n", "3 [0.40171781944971713, 0.45068385153723556] \n", "4 [0.426289197693183, 0.47286154945429826] \n", "5 [0.41995640334248435, 0.4700031722562804] \n", "6 [0.41769056823097406, 0.46910300795109927] \n", "7 [0.4008635970448561, 0.45035951241392264] \n", "8 [0.42608148808601887, 0.4771542125581683] \n", "9 [0.42371381294176236, 0.4755202390678374] \n", "10 [0.4216987472222824, 0.4728514463987078] \n", "11 [0.4033121085566903, 0.4538394876399772] \n", "12 [0.44664951656259694, 0.49571423536934833] \n", "13 [0.4461949137027948, 0.49362043574420744] \n", "14 [0.44641410132157283, 0.4953501020897601] \n", "15 [0.4278379251616576, 0.48027590543541343] \n" ] } ], "source": [ "n =980\n", "y = 437\n", "\n", "alpha_prior = [1,5,10,50]\n", "beta_prior = [1,5,10,50]\n", "pairs = [[ai, bi] for ai in alpha_prior for bi in beta_prior]\n", "\n", "results = []\n", "\n", "for a,b in pairs:\n", " alpha = y + a\n", " beta = n + b - y\n", " sample_posterior = np.random.beta(alpha, beta, size=500) \n", " sample_stats = np.quantile(sample_posterior, [0.05, 0.5, 0.95])\n", " results.append([a,b, a/(a+b), sample_stats, [sample_stats, sample_stats]])\n", "\n", "print(pd.DataFrame(results, columns=['alpha', 'beta', 'prior mean', 'post median','95% post CI']))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing to observe is that the posterior is not really sensitive to the prior when $\\alpha$ and $\\beta$ are small. In these cases, we observe that the 95% posterior interval never contains the value 0.485. When $\\alpha$ and (or) $\\beta$ become large enough the prior starts to affect the posterior inference and shifts the posterior interval towards the prior distribution. However, it is interesting to notice that for $\\alpha$ = 50 and $\\beta$ = 50, we have a prior mean of 0.5, close to the expectation we could have about the ratio of boys and girls births without having any knowledge. And eventhough the prior impacts the posterior inference we see that the value 0.485 is not contained in the posterior interval, supporting the fact that the *placenta previa* condition affects the ratio of boys/girls births." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Probability assignment (Bayesian Data Analysis, Gelman et al, Chapter 1)\n", "\n", "This exercise aims at showing how probabilities can be assigned starting from a set of subjective assessments.\n", "We will see how this can be done by first relying only on observed data. Then we will see how we can build a simple parametric model based on this empirical evidence." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "football_dataset_path = '/user/cabinade/home/Desktop/Bayesian_learning_course/bayesian-learning-git/Exercises/football_dataset.txt'\n", "data = pd.read_csv(football_dataset_path, index_col=False, header=0, sep=\",\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Football experts provide a *point spread* for every football game as a measure of the difference in ability between two teams. For instance, team A might be a 4-point favorite to defeat team B. This means that $p(team \\ A \\ wins \\ by \\ more \\ than \\ 4 \\ points) = \\frac{1}{2}$. The football dataset provides the point spread and actual game outcome for professional football games played between 1981 and 1984." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "outcome = np.array(data['favorite'] - data['underdog']) \n", "point_spread = np.array(data['spread'])\n", "\n", "plt.figure(figsize=(6,4))\n", "plt.scatter(point_spread + 0.2*np.random.rand((point_spread.shape)) - 0.1,\n", " outcome + 0.4*np.random.rand((outcome.shape)) - 0.2, s=3)\n", "plt.xlabel('Point spread')\n", "plt.ylabel('Outcome')\n", "plt.title('Outcome VS Point spread')\n", "plt.show()\n", "print('Number of games in dataset = ' + str(len(outcome)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning probabilities based on observed frequencies\n", "\n", "It is of interest to assign probabilities to particular events. A first and natural approach can be to rely on the data that's been gathered to obtain empirical estimates.\n", "\n", "**1) Compute:**\n", "\n", "- **P1 = Pr(Favorite wins)**\n", "- **P2 = Pr(Favorite wins | point spread = 3.5)**\n", "- **P3 = Pr(Favorite wins by more than the point spread)**\n", "- **P4 = Pr(Favorite wins by more than the point spread | point spread = 3.5)**\n", "\n", "We will consider a tied game as one-half win and one-half loss. We will also ignore games without any favorite (point spread = 0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n_favorite_wins = 0\n", "for out, ps in zip(outcome, point_spread):\n", " if ps != 0 and out > 0 :\n", " n_favorite_wins+=1\n", " elif ps != 0 and out==0:\n", " n_favorite_wins+=0.5\n", "\n", "#np.count_nonzero(point_spread): Counts only game in which there is a favorite\n", "\n", "P1 = n_favorite_wins / np.count_nonzero(point_spread)\n", "\n", "print('P1 = ' + str(P1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "i=0\n", "j=0\n", "\n", "for out, ps in zip(outcome, point_spread):\n", " if ps==3.5 and out>0:\n", " i+=1\n", " if ps==3.5:\n", " j+=1\n", " \n", "P2 = i / j\n", "print('P2 = ' + str(P2))\n", "\n", "i=0\n", "j=0\n", "\n", "for out, ps in zip(outcome, point_spread):\n", " if ps>0 and out>ps:\n", " i+=1\n", " if ps!=0:\n", " j+=1\n", "\n", "P3 = i / j \n", "print('P3 = ' + str(P3))\n", "\n", "i=0\n", "j=0\n", "\n", "for out, ps in zip(outcome, point_spread):\n", " if ps==3.5 and out>ps:\n", " i+=1\n", " if ps==3.5:\n", " j+=1\n", " \n", "P4 = i / j\n", "print('P4 = ' + str(P4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of these probabilities seem reasonable as they are based on the knowledge of football fans.\n", "\n", "**2) Compute the following probabilities and comment the results: **\n", "\n", "- **P5 = Pr(Favorite wins | point spread = 8.5)**\n", "- **P6 = Pr(Favorite wins | point spread = 9)**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "i=0\n", "j=0\n", "\n", "for out, ps in zip(outcome, point_spread):\n", " if ps==8.5 and out>0:\n", " i+=1\n", " if ps==8.5:\n", " j+=1\n", " \n", "P5 = i / j\n", "print('Number of wins of 8.5-point favorite: ' + str(i))\n", "print('Number of games with 8.5-point favorite: ' +str(j))\n", "print('P5 = ' + str(P5))\n", "\n", "i=0\n", "j=0\n", "\n", "for out, ps in zip(outcome, point_spread):\n", " if ps==9 and out>0:\n", " i+=1\n", " if ps==9:\n", " j+=1\n", " \n", "P6 = i / j\n", "print('Number of wins of 9-point favorite: ' + str(i))\n", "print('Number of games with 9-point favorite: ' +str(j))\n", "print('P6 = ' + str(P6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These probabilities tell us that a 8.5-point favorite is more likely to win than a 9-point favorite, which is counterintuitive. This is due to the small sample size of games with 8.5-point favorites. This limitation in probabilities assignements based on observed frequencies motivates the use of parametric models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A parametric model for the difference between outcome and point spread\n", "\n", "The graph belows shows the difference between a game outcome and the point spread, plotted against the point spread.\n", "Let's denote by y the outcome of a game and x its point spread." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = np.array(data['favorite'] - data['underdog'])\n", "x = np.array(data['spread'])\n", "z = y - x\n", "plt.figure(figsize=(6,4))\n", "plt.scatter(x + 0.2*np.random.rand((x.shape)) - 0.1, \n", " z + 0.4*np.random.rand((z.shape)) - 0.2, s=3)\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('z = y - x')\n", "plt.title('z vs x')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3) Plot the histogram of z, and the approximated Gaussian distribution of z|x.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "from scipy.stats import norm\n", "sample_mean = np.mean(z)\n", "sample_std_dev = np.std(z)\n", "print(\"Sample mean = \" + str(sample_mean))\n", "print(\"Sample std dev = \" + str(sample_std_dev))\n", "dist = norm(loc=sample_mean, scale=sample_std_dev)\n", "#plotting histogram and approximated Gaussian\n", "plt.figure()\n", "n, bins, patches = plt.hist(z, 50, density=True, facecolor='g', alpha=0.75)\n", "plt.scatter(z, dist.pdf(z), color='black')\n", "\n", "plt.xlabel('z = y - x')\n", "plt.ylabel('Probability')\n", "plt.grid(True)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The plot of z vs x suggests that it may be reasonable to model the distribution of z as independent of x. The histogram above shows the empirical distribution of z with a fitted normal distribution plotted over it. This graph suggests that it may be acceptable to approximate the distribution of the random variable z as a Normal distribution of mean $\\mu = 0.22$ and standard deviation of $\\sigma = 13.7$. For the rest of this exercise, we will assume z to follow a Gaussian distribution with these sample mean and standard deviation, and to be independent of x. So we can write: $z|x \\sim \\mathcal{N}(\\mu, \\sigma^{2})$. \n", "
\n", "
\n", "This model is not perfect as it does not exactly fit the data, and describes continuous-valued quantites while game scores or point-spreads or discrete. However, such a model provides an approximation that can be used to assign probabilities to events. Indeed:\n", "\n", "P(y > 0 | x) = P(z+x > 0 | x) = P(z > -x | x) = 1 - P(z < -x | x)\n", "\n", "As z follows a Gaussian distribution, we can easily compute its Cumulative Distribution Function (CDF).\n", "\n", "**4) Making use of the approximated distribution of z|x, compute the following probabilities:**\n", "\n", "- **P7 = Pr(Favorite wins | point spread = 3.5)**\n", "- **P8 = Pr(Favorite wins | point spread = 8.5)**\n", "- **P9 = Pr(Favorite wins | point spread = 9)**\n", "\n", "We derive the calculation for P7:\n", "\n", "$$\\textrm{P7} = \\textrm{Pr(Favorite wins | point spread} = 3.5) = \\textrm{Pr}(y > 0| x=3.5) = \\textrm{Pr}(z+x>0|x=3.5) = \\textrm{Pr}(z>-x | x=3.5) = 1-\\textrm{Pr}(z<-x|x=3.5).$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "P7 = 1 - dist.cdf(-3.5)\n", "P8 = 1 - dist.cdf(-8.5)\n", "P9 = 1 - dist.cdf(-9)\n", "\n", "print(\"P7 = \" + str(P7))\n", "print(\"P8 = \" + str(P8))\n", "print(\"P9 = \" + str(P9))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the probability for a 3.5-point favorite agrees with the empirical value given earlier,\n", "while the probabilities for 8.5- and 9-point favorites make more sense than the empirical values derived before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Posterior inference (Bayesian Data Analysis, Gelman et al, Chapter 2)\n", "\n", "This exercise illustrates how to do posterior inference using standard probability distributions introduced in the class.\n", "\n", "Suppose you have a Beta(4,4) prior distribution on the probability $\\theta$ that a coin will yield a \"head\" when spun in a specified manner. The coin is independently spun ten times, and \"heads\" appear fewer than 3 times. You don't know how many heads were seen, but only that their number is less than 3. We will denote by y the random variable giving the number of heads obtained after the 10 throws.\n", "\n", "**1) Write the prior probability distribution of $\\theta$ and the conditional y|$\\theta$.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\theta$: Probability that a coin will yield a \"head\", $p(\\theta) \\propto \\theta^{3}(1-\\theta)^{3}.$\n", "\n", "y: Random variable giving the number of heads over the n=10 independent throws, $p(y=k|\\theta) = \\binom{n}{k}\\theta^{k}(1-\\theta)^{n-k}.$\n", "\n", "**2) Write the data likelihood.**\n", "\n", "\n", "\\begin{align}\n", "p(y<3|\\theta) = p(y=0|\\theta) + p(y=1|\\theta) + p(y=2|\\theta) = (1-\\theta)^{10} + 10\\theta(1-\\theta)^{9} + 45\\theta^{2}(1-\\theta)^{8}\n", "\\end{align}\n", "\n", "\n", "**3) Calculate the posterior density of $\\theta$ (up to a constant).**\n", "\n", "We want to update our knowledge on the probability $\\theta$ to get a \"head\" knowing that over 10 throws there were less than 3 heads. In other words, we want to infer the quantity: $p(\\theta|y<3)$. By simply making use of the Bayes' rule we can write $p(\\theta|y<3) \\propto p(y<3|\\theta)p(\\theta)$, which gives:\n", "\n", "\n", "\\begin{align}\n", "p(\\theta|y<3) = \\theta^{3}(1-\\theta)^{13} + 10\\theta^{4}(1-\\theta)^{12} + 45\\theta^{5}(1-\\theta)^{11}\n", "\\end{align}\n", "\n", "\n", "**4) Plot the posterior distribution of $\\theta$ (up to a constant).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "theta = np.linspace(0,1,100)\n", "posterior_theta = theta**3*(1-theta)**13 + 10*theta**4*(1- theta)**12 + 45*theta**5*(1 - theta)**11\n", "\n", "plt.figure()\n", "plt.plot(theta, posterior_theta)\n", "plt.xlabel(\"theta\")\n", "plt.title(\"Posterior distribution of theta (up to a constant)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predictive prior distribution (Bayesian Data Analysis, Gelman et al, Chapter 2)\n", "\n", "In this exercise, we show how we can incorporate all the information we have about the parameters of an experiment, in order to derive a predictive prior over the results of this experiement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let y be the number of 6's in 1000 independent rolls of a particular real die, which may be unfair. Let $\\theta$ be the probability that the die lands on 6. We assume the following prior distribution for $\\theta$:\n", "\n", "\n", "\\begin{align}\n", "Pr(\\theta = \\frac{1}{12}) & = 0.25 \\\\\n", "Pr(\\theta = \\frac{1}{6}) & = 0.5 \\\\\n", "Pr(\\theta = \\frac{1}{4}) & = 0.25.\n", "\\end{align}\n", "\n", "\n", "**1) Using the normal approximation, give the conditional distribution p(y|$\\theta$).**\n", "\n", "In order to approximate p(y|$\\theta$) by a normal distribution we need to estimate its mean and variance.\n", "We know that the experiment consists in 1000 independent rolls of the dice, so we could model p(y|$\\theta$) by a binomial distribution. This distribution has a mean $\\mu = n*\\theta$, where n is the number of trials and a variance $\\sigma^{2} = n*\\theta(1-\\theta)$. Therefore we can compute estimates of the mean and variance of y for the different values of $\\theta$. In the case of $\\theta$ = 0.25 for instance, we would have $\\mu = 1000 * 0.25 = 250$, and $\\sigma^{2} = 1000 * 0.25 * 0.75 = 188$. So we obtain, p(y|$\\theta$=0.25) $\\sim \\mathcal{N}(250, 188)$.\n", "\n", "**2) Give an approximate prior distribution for p(y) and plot it.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the law of total probability, the prior distribution p(y) can be written:\n", "\n", "\n", "\\begin{align}\n", "p(y) = \\sum_i p(y, \\theta = p_i) = \\sum_i p(y|\\theta = p_i)p(\\theta = p_i)\n", "\\end{align}\n", "\n", "\n", "Making use of the previous result we can compute approximate distributions for the different values of $\\theta$.\n", "The approximate prior predictive distribution for y is:\n", "\n", "\n", "\\begin{align}\n", "p(y) = \\frac{1}{4}\\mathcal{N}(83 ; 76) + \\frac{1}{2}\\mathcal{N}(167 ; 139) + \\frac{1}{4}\\mathcal{N}(250 ; 188)\n", "\\end{align}\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = np.arange(50,300)\n", "pred_prior_y = 0.25*norm.pdf(y, 83, np.sqrt(76)) + 0.5*norm.pdf(y, 167, np.sqrt(139)) + 0.25*norm.pdf(y, 250, np.sqrt(188))\n", "\n", "plt.figure()\n", "plt.plot(y, pred_prior_y)\n", "plt.xlabel('y')\n", "plt.ylabel('density')\n", "plt.title('Approximate predictive prior distribution')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3) Give approximate 5%, 25%, 50%, 75%, 95% points for the distribution of y.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prior predicitve distribution of y is a sum of three Gaussian distributions. Moreover, we can see on the plot above that there is almost no overlap between them. So we can consider that 1/4 of the distribution is in the first hump, 1/2 in the second one, and 1/4 in the last one. We introduce the following probability distributions: $y_1 \\sim \\mathcal{N}(\\mu = 83, \\sigma^2 = 76)$, $y_2 \\sim \\mathcal{N}(\\mu = 167, \\sigma^2 = 139)$, $y_3 \\sim \\mathcal{N}(\\mu = 250,\\sigma^2 = 188)$.\n", "\n", "We also recall that, if we consider a random variable X with a CDF $F_X$, the $\\alpha$% points for the distribution of X is defined by:\n", "\n", "$$P(X < x) = \\alpha \\iff F_X(x) = \\alpha \\iff x = F_{X}^{-1}(\\alpha).$$\n", "\n", "Relying on this, we observe that the 5% points for y are completely contained in the first hump. However this hump represents only 1/4 of the data. Therefore, the 5% points for y are given by the 20% points of $y_1$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dist1 = norm(83, np.sqrt(76))\n", "five_percent_points = dist1.ppf(0.20)\n", "print(\"5% points = \" + str(np.rint(five_percent_points)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know that the first hump contains 25% percent of the distribution of y. Therefore, the 25% points are in between the first and second humps, which would be approximately 120 on the graph. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second hump contains 1/2 of the distribution of y so the 50% points is given by its mean 167. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last hump contains 1/4 of the distribution, so the 75% points is in between the second and third hump, which is approximately 207." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Finally the last hump contains the last 25% of the distribution. Before the hump we have 75% of the distribution of y, and in order to reach 95% we will need the 20% remaining which can be completely determined by y_3. However this distribution represents only 1/4 of y so we will need to determine its 80% points. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dist3 = norm(250, np.sqrt(188))\n", "ninety_five_percent_points = dist3.ppf(0.80)\n", "print(\"95% points = \" + str(np.rint(ninety_five_percent_points)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python3.6", "language": "python", "name": "python3.6" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 1, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 2 }