{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import pylab as pl\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Properties of the Mean\n", "\n", "In this course, we have used the words \"average\" and \"mean\" interchangeably, and will continue to do so. The definition of the mean will be familiar to you from your school days or even earlier.\n", "\n", "**Definition.** The *average* or *mean* of a collection of numbers is the sum of all the elements of the collection, divided by the number of elements in the collection.\n", "\n", "The methods `np.average` and `np.mean` return the mean of an array.\n", "\n", "[Numpy average](https://numpy.org/doc/stable/reference/generated/numpy.average.html)\n", "\n", "[Numpy mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "not_symmetric = np.array([2, 3, 3, 9])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.25" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.average(not_symmetric)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.25" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(not_symmetric)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Properties\n", "\n", "The definition and the example above point to some properties of the mean.\n", "\n", "- It need not be an element of the collection.\n", "- It need not be an integer even if all the elements of the collection are integers.\n", "- It is somewhere between the smallest and largest values in the collection.\n", "- It need not be halfway between the two extremes; it is not in general true that half the elements in a collection are above the mean.\n", "- If the collection consists of values of a variable measured in specified units, then the mean has the same units too.\n", "\n", "We will now study some other properties that are helpful in understanding the mean and its relation to other statistics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Mean is a \"Smoother\"\n", "\n", "You can think of taking the mean as an \"equalizing\" or \"smoothing\" operation. For example, imagine the entries in `not_symmetric` above as the dollars in the pockets of four different people. To get the mean, you first put all of the money into one big pot and then divide it evenly among the four people. They had started out with different amounts of money in their pockets (\\$2, \\$3, \\$3, and \\$9), but now each person has \\$4.25, the mean amount." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Proportions are Means\n", "If a collection consists only of ones and zeroes, then the sum of the collection is the number of ones in it, and the mean of the collection is the proportion of ones." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zero_one = np.array([1, 1, 1, 0])\n", "sum(zero_one)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(zero_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can replace 1 by the Boolean `True` and 0 by `False`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(np.array([True, True, True, False]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because proportions are a special case of means, results about random sample means apply to random sample proportions as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Mean and the Histogram\n", "The mean of the collection {2, 3, 3, 9} is 4.25, which is not the \"halfway point\" of the data. So then what does the mean measure?\n", "\n", "To see this, notice that the mean can be calculated in different ways.\n", "\n", "$$\\begin{align*}\n", "\\mbox{mean} ~ &=~ 4.25 \\\\ \\\\\n", "&=~ \\frac{2 + 3 + 3 + 9}{4} \\\\ \\\\\n", "&=~ 2 \\cdot \\frac{1}{4} ~~ + ~~ 3 \\cdot \\frac{1}{4} ~~ + ~~ 3 \\cdot \\frac{1}{4} ~~ + ~~ 9 \\cdot \\frac{1}{4} \\\\ \\\\\n", "&=~ 2 \\cdot \\frac{1}{4} ~~ + ~~ 3 \\cdot \\frac{2}{4} ~~ + ~~ 9 \\cdot \\frac{1}{4} \\\\ \\\\\n", "&=~ 2 \\cdot 0.25 ~~ + ~~ 3 \\cdot 0.5 ~~ + ~~ 9 \\cdot 0.25\n", "\\end{align*}$$\n", "\n", "The last expression is an example of a general fact: when we calculate the mean, each distinct value in the collection is *weighted* by the proportion of times it appears in the collection.\n", "\n", "This has an important consequence. The mean of a collection depends only on the distinct values and their proportions, not on the number of elements in the collection. In other words, the mean of a collection depends only on the distribution of values in the collection.\n", "\n", "Therefore, **if two collections have the same distribution, then they have the same mean.**\n", "\n", "For example, here is another collection that has the same distribution as `not_symmetric` and hence the same mean." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 3, 3, 9])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "not_symmetric" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.25" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "same_distribution = np.array([2, 2, 3, 3, 3, 3, 9, 9])\n", "np.mean(same_distribution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean is a physical attribute of the histogram of the distribution. Here is the histogram of the distribution of `not_symmetric` or equivalently the distribution of `same_distribution`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "remove_input" ] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "t1 = pd.DataFrame({'not symmetric':not_symmetric})\n", "\n", "unit = ''\n", "\n", "fig, ax = plt.subplots(figsize=(8,5))\n", "\n", "ax.hist(t1, density=True, color='blue', alpha=0.8, ec='white', zorder=5)\n", "\n", "y_vals = ax.get_yticks()\n", "\n", "y_label = 'Percent per ' + (unit if unit else 'unit')\n", "\n", "x_label = 'not symetric'\n", "\n", "ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n", "\n", "plt.ylabel(y_label)\n", "\n", "plt.xlabel(x_label)\n", "\n", "plt.title('');\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine the histogram as a figure made out of cardboard attached to a wire that runs along the horizontal axis, and imagine the bars as weights attached at the values 2, 3, and 9. Suppose you try to balance this figure on a point on the wire. If the point is near 2, the figure will tip over to the right. If the point is near 9, the figure will tip over to the left. Somewhere in between is the point where the figure will balance; that point is the 4.25, the mean.\n", "\n", "**The mean is the *center of gravity* or balance point of the histogram.**\n", "\n", "To understand why that is, it helps to know some physics. The center of gravity is calculated exactly as we calculated the mean, by using the distinct values weighted by their proportions.\n", "\n", "Because the mean is a balance point, it is sometimes displayed as a *fulcrum* or triangle at the base of the histogram." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "remove_input" ] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mean_ns = np.mean(not_symmetric)\n", "\n", "unit = ''\n", "\n", "fig, ax = plt.subplots(figsize=(8,5))\n", "\n", "ax.hist(t1, bins=np.arange(1.5, 9.6, 1), density=True, color='blue', alpha=0.8, ec='white', zorder=5)\n", "\n", "ax.plot([1.5, 9.5], [0, 0], color='grey', lw=8, zorder=10)\n", "\n", "ax.scatter(mean_ns, -0.009, marker='^', color='darkblue', s=60, zorder=15).set_clip_on(False)\n", "\n", "y_vals = ax.get_yticks()\n", "\n", "y_label = 'Percent per ' + (unit if unit else 'unit')\n", "\n", "x_label = 'Bootstrap Sample Median'\n", "\n", "ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n", "\n", "plt.ylim(-0.05, 0.5)\n", "\n", "plt.ylabel(y_label)\n", "\n", "plt.xlabel(x_label)\n", "\n", "plt.title('');\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Mean and the Median\n", "If a student's score on a test is below average, does that imply that the student is in the bottom half of the class on that test?\n", "\n", "Happily for the student, the answer is, \"Not necessarily.\" The reason has to do with the relation between the average, which is the balance point of the histogram, and the median, which is the \"half-way point\" of the data.\n", "\n", "The relationship is easy to see in a simple example. Here is a histogram of the collection {2, 3, 3, 4} which is in the array `symmetric`. The distribution is symmetric about 3. The mean and the median are both equal to 3." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "symmetric = np.array([2, 3, 3, 4])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "remove_input" ] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "t2 = pd.DataFrame({'symmetric':symmetric})\n", "\n", "mean_s = np.mean(symmetric)\n", "\n", "unit = ''\n", "\n", "fig, ax = plt.subplots(figsize=(7,5))\n", "\n", "ax.hist(t2, bins=np.arange(1.5, 4.6, 1), density=True, color='blue', alpha=0.8, ec='white', zorder=5)\n", "\n", "ax.scatter(mean_s, -0.009, marker='^', color='darkblue', s=60, zorder=15).set_clip_on(False)\n", "\n", "y_vals = ax.get_yticks()\n", "\n", "y_label = 'Percent per ' + (unit if unit else 'unit')\n", "\n", "x_label = 'symetric'\n", "\n", "ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n", "\n", "plt.xlim(1, 10)\n", "\n", "plt.ylim(-0.05, 0.5)\n", "\n", "plt.ylabel(y_label)\n", "\n", "plt.xlabel(x_label)\n", "\n", "plt.title('');\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(symmetric)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(symmetric, 50, interpolation='nearest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, **for symmetric distributions, the mean and the median are equal.**\n", "\n", "What if the distribution is not symmetric? Let's compare `symmetric` and `not_symmetric`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [ "remove_input" ] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "t3 = pd.DataFrame({'not_symemtric':not_symmetric})\n", "\n", "mean_s = np.mean(symmetric)\n", "\n", "unit = ''\n", "\n", "fig, ax = plt.subplots(figsize=(7,5))\n", "\n", "ax.hist(t2, bins=np.arange(1.5, 9.6, 1), density=True, \n", " color=('darkblue'), \n", " label='symetric', \n", " alpha=0.8, \n", " ec='white', \n", " zorder=5)\n", "\n", "ax.hist(t3, bins=np.arange(1.5, 9.6, 1), density=True, \n", " color=('gold'), \n", " label='not_symetric', \n", " alpha=0.8, ec='white', \n", " zorder=5)\n", "\n", "ax.legend()\n", "\n", "ax.scatter(mean_s, -0.009, marker='^', color='darkblue', \n", " s=60, \n", " zorder=15).set_clip_on(False)\n", "\n", "ax.scatter(mean_ns, -0.009, marker='^', color='gold', \n", " s=60, \n", " zorder=15).set_clip_on(False)\n", "\n", "y_vals = ax.get_yticks()\n", "\n", "y_label = 'Percent per ' + (unit if unit else 'unit')\n", "\n", "x_label = ''\n", "\n", "ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n", "\n", "plt.ylim(-0.05, 0.5)\n", "\n", "plt.ylabel(y_label)\n", "\n", "ax.legend(bbox_to_anchor=(1.04,1), loc=\"upper left\")\n", "\n", "plt.xlabel(x_label)\n", "\n", "plt.title('');\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The blue histogram represents the original `symmetric` distribution. The gold histogram of `not_symmetric` starts out the same as the blue at the left end, but its rightmost bar has slid over to the value 9. The darker-gold part is where the two histograms overlap.\n", "\n", "The median and mean of the blue distribution are both equal to 3. The median of the gold distribution is also equal to 3, though the right half is distributed differently from the left. \n", "\n", "But the mean of the gold distribution is not 3: the gold histogram would not balance at 3. The balance point has shifted to the right, to 4.25." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the gold distribution, 3 out of 4 entries (75%) are below average. The student with a below average score can therefore take heart. He or she might be in the majority of the class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, **if the histogram has a tail on one side (the formal term is \"skewed\"), then the mean is pulled away from the median in the direction of the tail.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "The table `sf2015` contains salary and benefits data for San Francisco City employees in 2015. As before, we will restrict our analysis to those who had the equivalent of at least half-time employment for the year." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "sf2015 = pd.read_csv(path_data + 'san_francisco_2015.csv')\n", "\n", "sf2015 = sf2015[sf2015['Salaries'] > 10000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw earlier, the highest compensation was above \\\\$600,000 but the vast majority of employees had compensations below \\\\$300,000." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "unit = ''\n", "\n", "fig, ax = plt.subplots(figsize=(7,5))\n", "\n", "ax.hist(sf2015['Total Compensation'], bins=np.arange(10000, 700000, 25000), density=True, color=('darkblue'),\n", " alpha=0.8, ec='white', zorder=5)\n", "\n", "y_vals = ax.get_yticks()\n", "\n", "y_label = 'Percent per ' + (unit if unit else 'unit')\n", "\n", "x_label = 'Total Compensation'\n", "\n", "ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])\n", "\n", "plt.ylabel(y_label)\n", "\n", "plt.xlabel(x_label)\n", "\n", "plt.xticks(rotation=90)\n", "\n", "plt.title('');\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This histogram is skewed to the right; it has a right-hand tail. \n", "\n", "The mean gets pulled away from the median in the direction of the tail. So we expect the mean compensation to be larger than the median, and that is indeed the case." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "110305.79" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compensation = sf2015['Total Compensation']\n", "np.percentile(compensation, 50)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "114725.98411824208" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(compensation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distributions of incomes of large populations tend to be right skewed. When the bulk of a population has middle to low incomes, but a very small proportion has very high incomes, the histogram has a long, thin tail to the right. \n", "\n", "The mean income is affected by this tail: the farther the tail stretches to the right, the larger the mean becomes. But the median is not affected by values at the extremes of the distribution. That is why economists often summarize income distributions by the median instead of the mean." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }