{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confidence Intervals\n", "We have developed a method for estimating a parameter by using random sampling and the bootstrap. Our method produces an interval of estimates, to account for chance variability in the random sample. By providing an interval of estimates instead of just one estimate, we give ourselves some wiggle room.\n", "\n", "In the previous example we saw that our process of estimation produced a good interval about 95% of the time, a \"good\" interval being one that contains the parameter. We say that we are *95% confident* that the process results in a good interval. Our interval of estimates is called a *95% confidence interval* for the parameter, and 95% is called the *confidence level* of the interval.\n", "\n", "The situation in the previous example was a bit unusual. Because we happened to know the value of the parameter, we were able to check whether an interval was good or not so good, and this in turn helped us to see that our process of estimation captured the parameter about 95 out of every 100 times we used it.\n", "\n", "But usually, data scientists don't know the value of the parameter. That is the reason they want to estimate it in the first place. In such situations, they provide an interval of estimates for the unknown parameter by using methods like the one we have developed. Because of statistical theory and demonstrations like the one we have seen, data scientists can be confident that their process of generating the interval results in a good interval a known percent of the time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confidence Interval for a Population Median: Bootstrap Percentile Method ###\n", "\n", "We will now use the bootstrap method to estimate an unknown population median. The data come from a sample of newborns in a large hospital system; we will treat it as if it were a simple random sample though the sampling was done in multiple stages. [Stat Labs](https://www.stat.berkeley.edu/~statlabs/) by Deborah Nolan and Terry Speed has details about a larger dataset from which this set is drawn. \n", "\n", "The table `baby` contains the following variables for mother-baby pairs: the baby's birth weight in ounces, the number of gestational days, the mother's age in completed years, the mother's height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "baby = pd.read_csv(path_data + 'baby.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Birth Weight | \n", "Gestational Days | \n", "Maternal Age | \n", "Maternal Height | \n", "Maternal Pregnancy Weight | \n", "Maternal Smoker | \n", "
---|---|---|---|---|---|---|
0 | \n", "120 | \n", "284 | \n", "27 | \n", "62 | \n", "100 | \n", "False | \n", "
1 | \n", "113 | \n", "282 | \n", "33 | \n", "64 | \n", "135 | \n", "False | \n", "
2 | \n", "128 | \n", "279 | \n", "28 | \n", "64 | \n", "115 | \n", "True | \n", "
3 | \n", "108 | \n", "282 | \n", "23 | \n", "67 | \n", "125 | \n", "True | \n", "
4 | \n", "136 | \n", "286 | \n", "25 | \n", "62 | \n", "93 | \n", "False | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1169 | \n", "113 | \n", "275 | \n", "27 | \n", "60 | \n", "100 | \n", "False | \n", "
1170 | \n", "128 | \n", "265 | \n", "24 | \n", "67 | \n", "120 | \n", "False | \n", "
1171 | \n", "130 | \n", "291 | \n", "30 | \n", "65 | \n", "150 | \n", "True | \n", "
1172 | \n", "125 | \n", "281 | \n", "21 | \n", "65 | \n", "110 | \n", "False | \n", "
1173 | \n", "117 | \n", "297 | \n", "38 | \n", "65 | \n", "129 | \n", "False | \n", "
1174 rows × 6 columns
\n", "\n", " | Birth Weight | \n", "Gestational Days | \n", "Ratio BW/GD | \n", "
---|---|---|---|
0 | \n", "120 | \n", "284 | \n", "0.422535 | \n", "
1 | \n", "113 | \n", "282 | \n", "0.400709 | \n", "
2 | \n", "128 | \n", "279 | \n", "0.458781 | \n", "
3 | \n", "108 | \n", "282 | \n", "0.382979 | \n", "
4 | \n", "136 | \n", "286 | \n", "0.475524 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1169 | \n", "113 | \n", "275 | \n", "0.410909 | \n", "
1170 | \n", "128 | \n", "265 | \n", "0.483019 | \n", "
1171 | \n", "130 | \n", "291 | \n", "0.446735 | \n", "
1172 | \n", "125 | \n", "281 | \n", "0.444840 | \n", "
1173 | \n", "117 | \n", "297 | \n", "0.393939 | \n", "
1174 rows × 3 columns
\n", "\n", " | Birth Weight | \n", "Gestational Days | \n", "Ratio BW/GD | \n", "
---|---|---|---|
238 | \n", "116 | \n", "148 | \n", "0.783784 | \n", "