{
"cells": [
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"path_data = '../../../../data/'\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Empirical Distributions\n",
"\n",
"In data science, the word \"empirical\" means \"observed\". Empirical distributions are distributions of observed data, such as data in random samples.\n",
"\n",
"In this section we will generate data and see what the empirical distribution looks like. \n",
"\n",
"Our setting is a simple experiment: rolling a die multiple times and keeping track of which face appears. The table `die` contains the numbers of spots on the faces of a die. All the numbers appear exactly once, as we are assuming that the die is fair."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Face
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
\n",
"
\n",
"
5
\n",
"
6
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Face\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 5\n",
"5 6"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"die = pd.DataFrame({'Face':np.arange(1, 7, 1)})\n",
"die"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Probability Distribution\n",
"\n",
"The histogram below helps us visualize the fact that every face appears with probability 1/6. We say that the histogram shows the *distribution* of probabilities over all the possible faces. Since all the bars represent the same percent chance, the distribution is called *uniform on the integers 1 through 6.*"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"die_bins = np.arange(0.5, 6.6, 1)\n",
"\n",
"unit = 'unit'\n",
"\n",
"fig, ax1 = plt.subplots()\n",
"\n",
"ax1.hist(die, bins=die_bins, density=True, alpha = 0.8, ec='white')\n",
"\n",
"y_vals = ax1.get_yticks()\n",
"\n",
"y_label = 'Percent per ' + (unit if unit else 'unit')\n",
"\n",
"x_label = 'Face'\n",
"\n",
"ax1.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n",
"\n",
"plt.ylabel(y_label)\n",
"\n",
"plt.xlabel(x_label)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Variables whose successive values are separated by the same fixed amount, such as the values on rolls of a die (successive values separated by 1), fall into a class of variables that are called *discrete*. The histogram above is called a *discrete* histogram. Its bins are specified by the array `die_bins` and ensure that each bar is centered over the corresponding integer value. \n",
"\n",
"It is important to remember that the die can't show 1.3 spots, or 5.2 spots – it always shows an integer number of spots. But our visualization spreads the probability of each value over the area of a bar. While this might seem a bit arbitrary at this stage of the course, it will become important later when we overlay smooth curves over discrete histograms.\n",
"\n",
"Before going further, let's make sure that the numbers on the axes make sense. The probability of each face is 1/6, which is 16.67% when rounded to two decimal places. The width of each bin is 1 unit. So the height of each bar is 16.67% per unit. This agrees with the horizontal and vertical scales of the graph."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Empirical Distributions\n",
"The distribution above consists of the theoretical probability of each face. It is not based on data. It can be studied and understood without any dice being rolled.\n",
"\n",
"*Empirical distributions,* on the other hand, are distributions of observed data. They can be visualized by *empirical histograms*. \n",
"\n",
"Let us get some data by simulating rolls of a die. This can be done by sampling at random with replacement from the integers 1 through 6. We have used `np.random.choice` for such simulations before. But now we will introduce a Table method for doing this. This will make it possible for us to use our familiar Table methods for visualization.\n",
"\n",
"The Table method is called `sample`. It draws at random with replacement from the rows of a table. Its argument is the sample size, and it returns a table consisting of the rows that were selected. An optional argument `with_replacement=False` specifies that the sample should be drawn without replacement, but that does not apply to rolling a die.\n",
"\n",
"Here are the results of 10 rolls of a die."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Face
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
0
\n",
"
1
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
\n",
"
\n",
"
5
\n",
"
6
\n",
"
\n",
"
\n",
"
0
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Face\n",
"2 3\n",
"0 1\n",
"3 4\n",
"2 3\n",
"2 3\n",
"3 4\n",
"3 4\n",
"4 5\n",
"5 6\n",
"0 1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"die.sample(10, replace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the same method to simulate as many rolls as we like, and then draw empirical histograms of the results. Because we are going to do this repeatedly, we define a function `empirical_hist_die` that takes the sample size as its argument, rolls a die as many times as its argument, and then draws a histogram of the observed results."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def empirical_hist_die(n):\n",
"\n",
" unit = 'unit'\n",
"\n",
" fig, ax1 = plt.subplots()\n",
"\n",
" ax1.hist(die.sample(n, replace=True), bins=die_bins, density=True, alpha=0.8, ec='white')\n",
"\n",
" y_vals = ax1.get_yticks()\n",
"\n",
" y_label = 'Percent per ' + (unit if unit else 'unit')\n",
"\n",
" x_label = 'Face'\n",
"\n",
" ax1.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n",
"\n",
" plt.ylabel(y_label)\n",
"\n",
" plt.xlabel(x_label)\n",
"\n",
" plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Empirical Histograms\n",
"\n",
"Here is an empirical histogram of 10 rolls. It doesn't look very much like the probability histogram above. Run the cell a few times to see how it varies."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"