{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"path_data = '../../../../data/'\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variability ###\n",
"The mean tells us where a histogram balances. But in almost every histogram we have seen, the values spread out on both sides of the mean. How far from the mean can they be? To answer this question, we will develop a measure of **variability about the mean**.\n",
"\n",
"We will start by describing how to calculate the measure. Then we will see why it is a good measure to calcualte."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Rough Size of Deviations from Average ###\n",
"For simplicity, we will begin our calcuations in the context of a simple array `any_numbers` consisting of just four values. As you will see, our method will extend easily to any other array of values."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"any_numbers = np.array([1, 2, 2, 10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal is to measure roughly how far off the numbers are from their average. To do this, we first need the average: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.75"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Step 1. The average.\n",
"\n",
"mean = np.mean(any_numbers)\n",
"mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's find out how far each value is from the mean. These are called the *deviations from the average*. A \"deviation from average\" is just a value minus the average. The table `calculation_steps` displays the results."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Value
\n",
"
Deviation from Average
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
-2.75
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
-1.75
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
-1.75
\n",
"
\n",
"
\n",
"
3
\n",
"
10
\n",
"
6.25
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Value Deviation from Average\n",
"0 1 -2.75\n",
"1 2 -1.75\n",
"2 2 -1.75\n",
"3 10 6.25"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Step 2. The deviations from average.\n",
"\n",
"deviations = any_numbers - mean\n",
"calculation_steps = pd.DataFrame(\n",
" {'Value':any_numbers,\n",
" 'Deviation from Average':deviations}\n",
" )\n",
"calculation_steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some of the deviations are negative; those correspond to values that are below average. Positive deviations correspond to above-average values.\n",
"\n",
"To calculate roughly how big the deviations are, it is natural to compute the mean of the deviations. But something interesting happens when all the deviations are added together:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sum(deviations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The positive deviations exactly cancel out the negative ones. This is true of all lists of numbers, no matter what the histogram of the list looks like: **the sum of the deviations from average is zero.** \n",
"\n",
"Since the sum of the deviations is 0, the mean of the deviations will be 0 as well:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(deviations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of this, the mean of the deviations is not a useful measure of the size of the deviations. What we really want to know is roughly how big the deviations are, regardless of whether they are positive or negative. So we need a way to eliminate the signs of the deviations.\n",
"\n",
"There are two time-honored ways of losing signs: the **absolute value**, and the **square**. It turns out that taking the square constructs a measure with extremely powerful properties, some of which we will study in this course.\n",
"\n",
"So let's eliminate the signs by squaring all the deviations. Then we will take the mean of the squares:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Value
\n",
"
Deviation from Average
\n",
"
Squared Deviations from Average
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
-2.75
\n",
"
7.5625
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
-1.75
\n",
"
3.0625
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
-1.75
\n",
"
3.0625
\n",
"
\n",
"
\n",
"
3
\n",
"
10
\n",
"
6.25
\n",
"
39.0625
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Value Deviation from Average Squared Deviations from Average\n",
"0 1 -2.75 7.5625\n",
"1 2 -1.75 3.0625\n",
"2 2 -1.75 3.0625\n",
"3 10 6.25 39.0625"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Step 3. The squared deviations from average\n",
"\n",
"squared_deviations = deviations ** 2\n",
"calculation_steps['Squared Deviations from Average'] = squared_deviations\n",
"calculation_steps"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13.1875"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Step 4. Variance = the mean squared deviation from average\n",
"\n",
"variance = np.mean(squared_deviations)\n",
"variance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Variance:** The mean squared deviation calculated above is called the *variance* of the values. \n",
"\n",
"While the variance does give us an idea of spread, it is not on the same scale as the original variable as its units are the square of the original. This makes interpretation very difficult. \n",
"\n",
"So we return to the original scale by taking the positive square root of the variance:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.6314597615834874"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Step 5.\n",
"# Standard Deviation: root mean squared deviation from average\n",
"# Steps of calculation: 5 4 3 2 1\n",
"\n",
"sd = variance ** 0.5\n",
"sd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standard Deviation ###\n",
"\n",
"The quantity that we have just computed is called the *standard deviation* of the list, and is abbreviated as SD. It measures roughly how far the numbers on the list are from their average.\n",
"\n",
"**Definition.** The SD of a list is defined as the *root mean square of deviations from average*. That's a mouthful. But read it from right to left and you have the sequence of steps in the calculation.\n",
"\n",
"**Computation.** The five steps described above result in the SD. You can also use the function ``np.std`` to compute the SD of values in an array:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.6314597615834874"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.std(any_numbers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with the SD ###\n",
"\n",
"To see what we can learn from the SD, let's move to a more interesting dataset than `any_numbers`. The table `nba13` contains data on the players in the National Basketball Association (NBA) in 2013. For each player, the table records the position at which the player usually played, his height in inches, his weight in pounds, and his age in years."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Position
\n",
"
Height
\n",
"
Weight
\n",
"
Age in 2013
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
DeQuan Jones
\n",
"
Guard
\n",
"
80
\n",
"
221
\n",
"
23
\n",
"
\n",
"
\n",
"
1
\n",
"
Darius Miller
\n",
"
Guard
\n",
"
80
\n",
"
235
\n",
"
23
\n",
"
\n",
"
\n",
"
2
\n",
"
Trevor Ariza
\n",
"
Guard
\n",
"
80
\n",
"
210
\n",
"
28
\n",
"
\n",
"
\n",
"
3
\n",
"
James Jones
\n",
"
Guard
\n",
"
80
\n",
"
215
\n",
"
32
\n",
"
\n",
"
\n",
"
4
\n",
"
Wesley Johnson
\n",
"
Guard
\n",
"
79
\n",
"
215
\n",
"
26
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
500
\n",
"
Joel Anthony
\n",
"
Center
\n",
"
81
\n",
"
245
\n",
"
31
\n",
"
\n",
"
\n",
"
501
\n",
"
Bismack Biyombo
\n",
"
Center
\n",
"
81
\n",
"
229
\n",
"
21
\n",
"
\n",
"
\n",
"
502
\n",
"
Luis Scola
\n",
"
Center
\n",
"
81
\n",
"
245
\n",
"
33
\n",
"
\n",
"
\n",
"
503
\n",
"
Lavoy Allen
\n",
"
Center
\n",
"
81
\n",
"
225
\n",
"
24
\n",
"
\n",
"
\n",
"
504
\n",
"
Boris Diaw
\n",
"
Center
\n",
"
80
\n",
"
235
\n",
"
31
\n",
"
\n",
" \n",
"
\n",
"
505 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Name Position Height Weight Age in 2013\n",
"0 DeQuan Jones Guard 80 221 23\n",
"1 Darius Miller Guard 80 235 23\n",
"2 Trevor Ariza Guard 80 210 28\n",
"3 James Jones Guard 80 215 32\n",
"4 Wesley Johnson Guard 79 215 26\n",
".. ... ... ... ... ...\n",
"500 Joel Anthony Center 81 245 31\n",
"501 Bismack Biyombo Center 81 229 21\n",
"502 Luis Scola Center 81 245 33\n",
"503 Lavoy Allen Center 81 225 24\n",
"504 Boris Diaw Center 80 235 31\n",
"\n",
"[505 rows x 5 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nba13 = pd.read_csv(path_data + 'nba2013.csv')\n",
"nba13"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a histogram of the players' heights."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"unit = ''\n",
"\n",
"fig, ax = plt.subplots(figsize=(8,5))\n",
"\n",
"ax.hist(nba13['Height'], bins=np.arange(68, 88, 1), density=True, color='blue', alpha=0.8, ec='white', zorder=5)\n",
"\n",
"y_vals = ax.get_yticks()\n",
"\n",
"y_label = 'Percent per ' + (unit if unit else 'unit')\n",
"\n",
"x_label = 'Height'\n",
"\n",
"ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n",
"\n",
"plt.ylabel(y_label)\n",
"\n",
"plt.xlabel(x_label)\n",
"\n",
"plt.title('');\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is no surprise that NBA players are tall! Their average height is just over 79 inches (6'7\"), about 10 inches taller than the average height of men in the United States."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"79.06534653465347"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_height = np.mean(nba13['Height'])\n",
"mean_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"About how far off are the players' heights from the average? This is measured by the SD of the heights, which is about 3.45 inches."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.450597183027555"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sd_height = np.std(nba13['Height'])\n",
"sd_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The towering center Hasheem Thabeet of the Oklahoma City Thunder was the tallest player at a height of 87 inches."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Position
\n",
"
Height
\n",
"
Weight
\n",
"
Age in 2013
\n",
"
\n",
" \n",
" \n",
"
\n",
"
413
\n",
"
Hasheem Thabeet
\n",
"
Center
\n",
"
87
\n",
"
263
\n",
"
26
\n",
"
\n",
"
\n",
"
414
\n",
"
Roy Hibbert
\n",
"
Center
\n",
"
86
\n",
"
278
\n",
"
26
\n",
"
\n",
"
\n",
"
415
\n",
"
Alex Len
\n",
"
Center
\n",
"
85
\n",
"
255
\n",
"
20
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Position Height Weight Age in 2013\n",
"413 Hasheem Thabeet Center 87 263 26\n",
"414 Roy Hibbert Center 86 278 26\n",
"415 Alex Len Center 85 255 20"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nba13.sort_values(by=['Height'], ascending=False).head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thabeet was about 8 inches above the average height."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7.934653465346528"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"87 - mean_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's a deviation from average, and it is about 2.3 times the standard deviation:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.299501519439792"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(87 - mean_height)/sd_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In other words, the height of the tallest player was about 2.3 SDs above average.\n",
"\n",
"At 69 inches tall, Isaiah Thomas was one of the two shortest NBA players in 2013. His height was about 2.9 SDs below average."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Position
\n",
"
Height
\n",
"
Weight
\n",
"
Age in 2013
\n",
"
\n",
" \n",
" \n",
"
\n",
"
201
\n",
"
Nate Robinson
\n",
"
Guard
\n",
"
69
\n",
"
180
\n",
"
29
\n",
"
\n",
"
\n",
"
200
\n",
"
Isaiah Thomas
\n",
"
Guard
\n",
"
69
\n",
"
185
\n",
"
24
\n",
"
\n",
"
\n",
"
199
\n",
"
Phil Pressey
\n",
"
Guard
\n",
"
71
\n",
"
175
\n",
"
22
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Position Height Weight Age in 2013\n",
"201 Nate Robinson Guard 69 180 29\n",
"200 Isaiah Thomas Guard 69 185 24\n",
"199 Phil Pressey Guard 71 175 22"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nba13.sort_values(by=['Height']).head(3)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-2.916986828877584"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(69 - mean_height)/sd_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we have observed is that the tallest and shortest players were both just a few SDs away from the average height. This is an example of why the SD is a useful measure of spread. No matter what the shape of the histogram, the average and the SD together tell you a lot about where the histogram is situated on the number line."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### First main reason for measuring spread by the SD\n",
"\n",
"**Informal statement.** In all numerical data sets, the bulk of the entries are within the range \"average $\\pm$ a few SDs\".\n",
"\n",
"For now, resist the desire to know exactly what fuzzy words like \"bulk\" and \"few\" mean. We wil make them precise later in this section. Let's just examine the statement in the context of some more examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have already seen that *all* of the heights of the NBA players were in the range \"average $\\pm$ 3 SDs\". \n",
"\n",
"What about the ages? Here is a histogram of the distribution, along with the mean and SD of the ages."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"unit = ''\n",
"\n",
"fig, ax = plt.subplots(figsize=(8,5))\n",
"\n",
"ax.hist(nba13['Age in 2013'], bins=np.arange(15, 45, 1), density=True, color='blue', alpha=0.8, ec='white', zorder=5)\n",
"\n",
"y_vals = ax.get_yticks()\n",
"\n",
"y_label = 'Percent per ' + (unit if unit else 'unit')\n",
"\n",
"x_label = 'Age in 2013'\n",
"\n",
"ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n",
"\n",
"plt.ylabel(y_label)\n",
"\n",
"plt.xlabel(x_label)\n",
"\n",
"plt.title('');\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(26.19009900990099, 4.321200441720307)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ages = nba13['Age in 2013']\n",
"mean_age = np.mean(ages)\n",
"sd_age = np.std(ages)\n",
"mean_age, sd_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The average age was just over 26 years, and the SD was about 4.3 years.\n",
"\n",
"How far off were the ages from the average? Just as we did with the heights, let's look at the two extreme values of the ages.\n",
"\n",
"Juwan Howard was the oldest player, at 40. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Position
\n",
"
Height
\n",
"
Weight
\n",
"
Age in 2013
\n",
"
\n",
" \n",
" \n",
"
\n",
"
294
\n",
"
Juwan Howard
\n",
"
Forward
\n",
"
81
\n",
"
250
\n",
"
40
\n",
"
\n",
"
\n",
"
172
\n",
"
Derek Fisher
\n",
"
Guard
\n",
"
73
\n",
"
210
\n",
"
39
\n",
"
\n",
"
\n",
"
466
\n",
"
Marcus Camby
\n",
"
Center
\n",
"
83
\n",
"
235
\n",
"
39
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Position Height Weight Age in 2013\n",
"294 Juwan Howard Forward 81 250 40\n",
"172 Derek Fisher Guard 73 210 39\n",
"466 Marcus Camby Center 83 235 39"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nba13.sort_values(by=['Age in 2013'], ascending=False).head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Howard's age was about 3.2 SDs above average."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.1958482778922357"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(40 - mean_age)/sd_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The youngest was 15-year-old Jarvis Varnado, who won the NBA Championship that year with the Miami Heat. His age was about 2.6 SDs below average."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Position
\n",
"
Height
\n",
"
Weight
\n",
"
Age in 2013
\n",
"
\n",
" \n",
" \n",
"
\n",
"
293
\n",
"
Jarvis Varnado
\n",
"
Forward
\n",
"
81
\n",
"
230
\n",
"
15
\n",
"
\n",
"
\n",
"
296
\n",
"
Giannis Antetokounmpo
\n",
"
Forward
\n",
"
81
\n",
"
205
\n",
"
18
\n",
"
\n",
"
\n",
"
277
\n",
"
Livio Jean-Charles
\n",
"
Forward
\n",
"
81
\n",
"
217
\n",
"
19
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Position Height Weight Age in 2013\n",
"293 Jarvis Varnado Forward 81 230 15\n",
"296 Giannis Antetokounmpo Forward 81 205 18\n",
"277 Livio Jean-Charles Forward 81 217 19"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nba13.sort_values(by=['Age in 2013']).head(3)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-2.589581103867081"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(15 - mean_age)/sd_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we have observed for the heights and ages is true in great generality. For *all* lists, the bulk of the entries are no more than 2 or 3 SDs away from the average. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chebychev's Bounds ###\n",
"The Russian mathematician [Pafnuty Chebychev](https://en.wikipedia.org/wiki/Pafnuty_Chebyshev) (1821-1894) proved a result that makes our rough statements precise.\n",
"\n",
"**For all lists, and all numbers $z$, the proportion of entries that are in the range\n",
"\"average $\\pm z$ SDs\" is at least $1 - \\frac{1}{z^2}$.**\n",
"\n",
"It is important to note that the result gives a bound, not an exact value or an approximation.\n",
"\n",
"What makes the result powerful is that it is true for all lists – all distributions, no matter how irregular. \n",
"\n",
"Specifically, it says that for every list:\n",
"\n",
"- the proportion in the range \"average $\\pm$ 2 SDs\" is **at least 1 - 1/4 = 0.75**\n",
"\n",
"- the proportion in the range \"average $\\pm$ 3 SDs\" is **at least 1 - 1/9 $\\approx$ 0.89**\n",
"\n",
"- the proportion in the range \"average $\\pm$ 4.5 SDs\" is **at least 1 - 1/$\\boldsymbol{4.5^2}$ $\\approx$ 0.95**\n",
"\n",
"As we noted above, Chebychev's result gives a lower bound, not an exact answer or an approximation. For example, the percent of entries in the range \"average $\\pm ~2$ SDs\" might be quite a bit larger than 75%. But it cannot be smaller."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standard units\n",
"\n",
"In the calculations above, the quantity $z$ measures *standard units*, the number of standard deviations above average.\n",
"\n",
"Some values of standard units are negative, corresponding to original values that are below average. Other values of standard units are positive. But no matter what the distribution of the list looks like, Chebychev's bounds imply that standard units will typically be in the (-5, 5) range.\n",
"\n",
"To convert a value to standard units, first find how far it is from average, and then compare that deviation with the standard deviation.\n",
"$$\n",
"z ~=~ \\frac{\\mbox{value }-\\mbox{ average}}{\\mbox{SD}}\n",
"$$\n",
"\n",
"As we will see, standard units are frequently used in data analysis. So it is useful to define a function that converts an array of numbers to standard units."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def standard_units(numbers_array):\n",
" \"Convert any array of numbers to standard units.\"\n",
" return (numbers_array - np.mean(numbers_array))/np.std(numbers_array) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example ###\n",
"As we saw in an earlier section, the table `united` contains a column `Delay` consisting of the departure delay times, in minutes, of over thousands of United Airlines flights in the summer of 2015. We will create a new column called `Delay (Standard Units)` by applying the function `standard_units` to the column of delay times. This allows us to see all the delay times in minutes as well as their corresponding values in standard units. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Date
\n",
"
Flight Number
\n",
"
Destination
\n",
"
Delay
\n",
"
Delay (Standard Units)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
6/1/15
\n",
"
73
\n",
"
HNL
\n",
"
257
\n",
"
6.087655
\n",
"
\n",
"
\n",
"
1
\n",
"
6/1/15
\n",
"
217
\n",
"
EWR
\n",
"
28
\n",
"
0.287279
\n",
"
\n",
"
\n",
"
2
\n",
"
6/1/15
\n",
"
237
\n",
"
STL
\n",
"
-3
\n",
"
-0.497924
\n",
"
\n",
"
\n",
"
3
\n",
"
6/1/15
\n",
"
250
\n",
"
SAN
\n",
"
0
\n",
"
-0.421937
\n",
"
\n",
"
\n",
"
4
\n",
"
6/1/15
\n",
"
267
\n",
"
PHL
\n",
"
64
\n",
"
1.199129
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
13820
\n",
"
8/31/15
\n",
"
1978
\n",
"
LAS
\n",
"
-4
\n",
"
-0.523254
\n",
"
\n",
"
\n",
"
13821
\n",
"
8/31/15
\n",
"
1993
\n",
"
IAD
\n",
"
8
\n",
"
-0.219304
\n",
"
\n",
"
\n",
"
13822
\n",
"
8/31/15
\n",
"
1994
\n",
"
ORD
\n",
"
3
\n",
"
-0.345950
\n",
"
\n",
"
\n",
"
13823
\n",
"
8/31/15
\n",
"
2000
\n",
"
PHX
\n",
"
-1
\n",
"
-0.447266
\n",
"
\n",
"
\n",
"
13824
\n",
"
8/31/15
\n",
"
2013
\n",
"
EWR
\n",
"
-2
\n",
"
-0.472595
\n",
"
\n",
" \n",
"
\n",
"
13825 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Date Flight Number Destination Delay Delay (Standard Units)\n",
"0 6/1/15 73 HNL 257 6.087655\n",
"1 6/1/15 217 EWR 28 0.287279\n",
"2 6/1/15 237 STL -3 -0.497924\n",
"3 6/1/15 250 SAN 0 -0.421937\n",
"4 6/1/15 267 PHL 64 1.199129\n",
"... ... ... ... ... ...\n",
"13820 8/31/15 1978 LAS -4 -0.523254\n",
"13821 8/31/15 1993 IAD 8 -0.219304\n",
"13822 8/31/15 1994 ORD 3 -0.345950\n",
"13823 8/31/15 2000 PHX -1 -0.447266\n",
"13824 8/31/15 2013 EWR -2 -0.472595\n",
"\n",
"[13825 rows x 5 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"united = pd.read_csv(path_data + 'united_summer2015.csv')\n",
"\n",
"united['Delay (Standard Units)'] = standard_units(united['Delay'])\n",
"\n",
"united"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The standard units that we can see are consistent with what we expect based on Chebychev's bounds. Most are of quite small size; only one is above 6.\n",
"\n",
"But something rather alarming happens when we sort the delay times from highest to lowest. The standard units that we can see are extremely high!"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Date
\n",
"
Flight Number
\n",
"
Destination
\n",
"
Delay
\n",
"
Delay (Standard Units)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
3140
\n",
"
6/21/15
\n",
"
1964
\n",
"
SEA
\n",
"
580
\n",
"
14.268971
\n",
"
\n",
"
\n",
"
3154
\n",
"
6/22/15
\n",
"
300
\n",
"
HNL
\n",
"
537
\n",
"
13.179818
\n",
"
\n",
"
\n",
"
3069
\n",
"
6/21/15
\n",
"
1149
\n",
"
IAD
\n",
"
508
\n",
"
12.445272
\n",
"
\n",
"
\n",
"
2888
\n",
"
6/20/15
\n",
"
353
\n",
"
ORD
\n",
"
505
\n",
"
12.369285
\n",
"
\n",
"
\n",
"
12627
\n",
"
8/23/15
\n",
"
1589
\n",
"
ORD
\n",
"
458
\n",
"
11.178815
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
13568
\n",
"
8/30/15
\n",
"
602
\n",
"
SAN
\n",
"
-13
\n",
"
-0.751216
\n",
"
\n",
"
\n",
"
12503
\n",
"
8/22/15
\n",
"
1723
\n",
"
KOA
\n",
"
-14
\n",
"
-0.776545
\n",
"
\n",
"
\n",
"
2900
\n",
"
6/20/15
\n",
"
464
\n",
"
PDX
\n",
"
-15
\n",
"
-0.801874
\n",
"
\n",
"
\n",
"
12565
\n",
"
8/23/15
\n",
"
587
\n",
"
PDX
\n",
"
-16
\n",
"
-0.827203
\n",
"
\n",
"
\n",
"
788
\n",
"
6/6/15
\n",
"
525
\n",
"
IAD
\n",
"
-16
\n",
"
-0.827203
\n",
"
\n",
" \n",
"
\n",
"
13825 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Date Flight Number Destination Delay Delay (Standard Units)\n",
"3140 6/21/15 1964 SEA 580 14.268971\n",
"3154 6/22/15 300 HNL 537 13.179818\n",
"3069 6/21/15 1149 IAD 508 12.445272\n",
"2888 6/20/15 353 ORD 505 12.369285\n",
"12627 8/23/15 1589 ORD 458 11.178815\n",
"... ... ... ... ... ...\n",
"13568 8/30/15 602 SAN -13 -0.751216\n",
"12503 8/22/15 1723 KOA -14 -0.776545\n",
"2900 6/20/15 464 PDX -15 -0.801874\n",
"12565 8/23/15 587 PDX -16 -0.827203\n",
"788 6/6/15 525 IAD -16 -0.827203\n",
"\n",
"[13825 rows x 5 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"united.sort_values(by=['Delay'], ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What this shows is that it is possible for data to be many SDs above average (and for flights to be delayed by almost 10 hours). The highest value of delay is more than 14 in standard units. \n",
"\n",
"However, the proportion of these extreme values is small, and Chebychev's bounds still hold true. For example, let us calculate the percent of delay times that are in the range \"average $\\pm$ 3 SDs\". This is the same as the percent of times for which the standard units are in the range (-3, 3). That is about 98%, as computed below, consistent with Chebychev's bound of \"at least 89%\". "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9790235081374322"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"within_3_sd = united[(united['Delay (Standard Units)'] >= -3) & (united['Delay (Standard Units)'] <= 3)]\n",
"len(within_3_sd)/len(united)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The histogram of delay times is shown below, with the horizontal axis in standard units. By the table above, the right hand tail continues all the way out to $z=14.27$ standard units (580 minutes). The area of the histogram outside the range $z=-3$ to $z=3$ is about 2%, put together in tiny little bits that are mostly invisible in the histogram."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"