{
"cells": [
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"path_data = '../../data/'\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying a Function to a Column\n",
"\n",
"We have seen many examples of creating new columns of tables by applying functions to existing columns or to other arrays. All of those functions took arrays as their arguments. But frequently we will want to convert the entries in a column by a function that doesn't take an array as its argument. For example, it might take just one number as its argument, as in the function `cut_off_at_100` defined below."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"def cut_off_at_100(x):\n",
" \"\"\"The smaller of x and 100\"\"\"\n",
" return min(x, 100)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"17"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cut_off_at_100(17)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cut_off_at_100(117)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cut_off_at_100(100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `cut_off_at_100` simply returns its argument if the argument is less than or equal to 100. But if the argument is greater than 100, it returns 100.\n",
"\n",
"In our earlier examples using Census data, we saw that the variable `AGE` had a value 100 that meant \"100 years old or older\". Cutting off ages at 100 in this manner is exactly what `cut_off_at_100` does.\n",
"\n",
"To use this function on many ages at once, we will have to be able to *refer* to the function itself, without actually calling it. Analogously, we might show a cake recipe to a chef and ask her to use it to bake 6 cakes. In that scenario, we are not using the recipe to bake any cakes ourselves; our role is merely to refer the chef to the recipe. Similarly, we can ask a table to call `cut_off_at_100` on 6 different numbers in a column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we create the table `ages` with a column for people and one for their ages. For example, person `C` is 52 years old."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
\n",
"
\n",
"
Person
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
A
\n",
"
17
\n",
"
\n",
"
\n",
"
B
\n",
"
117
\n",
"
\n",
"
\n",
"
C
\n",
"
52
\n",
"
\n",
"
\n",
"
D
\n",
"
100
\n",
"
\n",
"
\n",
"
E
\n",
"
6
\n",
"
\n",
"
\n",
"
F
\n",
"
101
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age\n",
"Person \n",
"A 17\n",
"B 117\n",
"C 52\n",
"D 100\n",
"E 6\n",
"F 101"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ages = pd.DataFrame(\n",
" {'Person':np.array(['A', 'B', 'C', 'D', 'E', 'F']),\n",
" 'Age':np.array([17, 117, 52, 100, 6, 101])}\n",
")\n",
"ages = ages.set_index('Person')\n",
"ages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `map` \n",
"\n",
"To cut off each of the ages at 100, we will use the a new table method. The [`map`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) method calls a function on each element of a column, forming a new array of return values. To indicate which function to call, just name it (without quotation marks or parentheses). \n",
"\n",
"Pandas [map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person\n",
"A 17\n",
"B 100\n",
"C 52\n",
"D 100\n",
"E 6\n",
"F 100\n",
"Name: Age, dtype: int64"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ages['Age'].map(cut_off_at_100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we have done here is `map` the function `cut_off_at_100` to each value in the `Age` column of the table `ages`. The output is the array of corresponding return values of the function. For example, 17 stayed 17, 117 became 100, 52 stayed 52, and so on.\n",
"\n",
"This array, which has the same length as the original `Age` column of the `ages` table, can be used as the values in a new column called `Cut Off Age` alongside the existing `Person` (which has been set as index) and `Age` columns."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Cut Off Age
\n",
"
\n",
"
\n",
"
Person
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
A
\n",
"
17
\n",
"
17
\n",
"
\n",
"
\n",
"
B
\n",
"
117
\n",
"
100
\n",
"
\n",
"
\n",
"
C
\n",
"
52
\n",
"
52
\n",
"
\n",
"
\n",
"
D
\n",
"
100
\n",
"
100
\n",
"
\n",
"
\n",
"
E
\n",
"
6
\n",
"
6
\n",
"
\n",
"
\n",
"
F
\n",
"
101
\n",
"
100
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Cut Off Age\n",
"Person \n",
"A 17 17\n",
"B 117 100\n",
"C 52 52\n",
"D 100 100\n",
"E 6 6\n",
"F 101 100"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ages['Cut Off Age'] = ages['Age'].map(cut_off_at_100)\n",
"ages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Functions as Values\n",
"We've seen that Python has many kinds of values. For example, `6` is a number value, `\"cake\"` is a text value, `pd.DataFrame({ : })` is an empty table, and `ages` is a name for a table value (since we defined it above).\n",
"\n",
"In Python, every function, including `cut_off_at_100`, is also a value. It helps to think about recipes again. A recipe for cake is a real thing, distinct from cakes or ingredients, and you can give it a name like \"Ani's cake recipe.\" When we defined `cut_off_at_100` with a `def` statement, we actually did two separate things: we created a function that cuts off numbers at 100, and we gave it the name `cut_off_at_100`.\n",
"\n",
"We can refer to any function by writing its name, without the parentheses or arguments necessary to actually call it. We did this when we called `apply` above. When we write a function's name by itself as the last line in a cell, Python produces a text representation of the function, just like it would print out a number or a string value."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cut_off_at_100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that we did not write `\"cut_off_at_100\"` with quotes (which is just a piece of text), or `cut_off_at_100()` (which is a function call, and an invalid one at that). We simply wrote `cut_off_at_100` to refer to the function.\n",
"\n",
"Just like we can define new names for other values, we can define new names for functions. For example, suppose we want to refer to our function as `cut_off` instead of `cut_off_at_100`. We can just write this:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"cut_off = cut_off_at_100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now `cut_off` is a name for a function. It's the same function as `cut_off_at_100`, so the printed value is exactly the same."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cut_off"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us see another application of `map`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: Prediction\n",
"Data Science is often used to make predictions about the future. If we are trying to predict an outcome for a particular individual – for example, how they will respond to a treatment, or whether they will buy a product – it is natural to base the prediction on the outcomes of other similar individuals.\n",
"\n",
"Charles Darwin's cousin [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton) was a pioneer in using this idea to make predictions based on numerical data. He studied how physical characteristics are passed down from one generation to the next.\n",
"\n",
"The data below are Galton's carefully collected measurements on the heights of parents and their adult children. Each row corresponds to one adult child. The variables are a numerical code for the family, the heights (in inches) of the father and mother, a \"midparent height\" which is a weighted average [[1]](#footnotes) of the height of the two parents, the number of children in the family, as well as the child's birth rank (1 = oldest), gender, and height."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
family
\n",
"
father
\n",
"
mother
\n",
"
midparentHeight
\n",
"
children
\n",
"
childNum
\n",
"
gender
\n",
"
childHeight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
78.5
\n",
"
67.0
\n",
"
75.43
\n",
"
4
\n",
"
1
\n",
"
male
\n",
"
73.2
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
78.5
\n",
"
67.0
\n",
"
75.43
\n",
"
4
\n",
"
2
\n",
"
female
\n",
"
69.2
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
78.5
\n",
"
67.0
\n",
"
75.43
\n",
"
4
\n",
"
3
\n",
"
female
\n",
"
69.0
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
78.5
\n",
"
67.0
\n",
"
75.43
\n",
"
4
\n",
"
4
\n",
"
female
\n",
"
69.0
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
75.5
\n",
"
66.5
\n",
"
73.66
\n",
"
4
\n",
"
1
\n",
"
male
\n",
"
73.5
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
929
\n",
"
203
\n",
"
62.0
\n",
"
66.0
\n",
"
66.64
\n",
"
3
\n",
"
1
\n",
"
male
\n",
"
64.0
\n",
"
\n",
"
\n",
"
930
\n",
"
203
\n",
"
62.0
\n",
"
66.0
\n",
"
66.64
\n",
"
3
\n",
"
2
\n",
"
female
\n",
"
62.0
\n",
"
\n",
"
\n",
"
931
\n",
"
203
\n",
"
62.0
\n",
"
66.0
\n",
"
66.64
\n",
"
3
\n",
"
3
\n",
"
female
\n",
"
61.0
\n",
"
\n",
"
\n",
"
932
\n",
"
204
\n",
"
62.5
\n",
"
63.0
\n",
"
65.27
\n",
"
2
\n",
"
1
\n",
"
male
\n",
"
66.5
\n",
"
\n",
"
\n",
"
933
\n",
"
204
\n",
"
62.5
\n",
"
63.0
\n",
"
65.27
\n",
"
2
\n",
"
2
\n",
"
female
\n",
"
57.0
\n",
"
\n",
" \n",
"
\n",
"
934 rows × 8 columns
\n",
"
"
],
"text/plain": [
" family father mother midparentHeight children childNum gender \\\n",
"0 1 78.5 67.0 75.43 4 1 male \n",
"1 1 78.5 67.0 75.43 4 2 female \n",
"2 1 78.5 67.0 75.43 4 3 female \n",
"3 1 78.5 67.0 75.43 4 4 female \n",
"4 2 75.5 66.5 73.66 4 1 male \n",
".. ... ... ... ... ... ... ... \n",
"929 203 62.0 66.0 66.64 3 1 male \n",
"930 203 62.0 66.0 66.64 3 2 female \n",
"931 203 62.0 66.0 66.64 3 3 female \n",
"932 204 62.5 63.0 65.27 2 1 male \n",
"933 204 62.5 63.0 65.27 2 2 female \n",
"\n",
" childHeight \n",
"0 73.2 \n",
"1 69.2 \n",
"2 69.0 \n",
"3 69.0 \n",
"4 73.5 \n",
".. ... \n",
"929 64.0 \n",
"930 62.0 \n",
"931 61.0 \n",
"932 66.5 \n",
"933 57.0 \n",
"\n",
"[934 rows x 8 columns]"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"galton = pd.read_csv(path_data + 'galton.csv')\n",
"\n",
"galton"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A primary reason for collecting the data was to be able to predict the adult height of a child born to parents similar to those in the dataset. Let us try to do this, using midparent height as the variable on which to base our prediction. Thus midparent height is our *predictor* variable.\n",
"\n",
"The table `heights` consists of just the midparent heights and child's heights. The scatter plot of the two variables shows a positive association, as we would expect for these variables."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"heights.plot.scatter('MidParent', 'Child')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now suppose Galton encountered a new couple, similar to those in his dataset, and wondered how tall their child would be. What would be a good way for him to go about predicting the child's height, given that the midparent height was, say, 68 inches?\n",
"\n",
"One reasonable approach would be to base the prediction on all the points that correspond to a midparent height of around 68 inches. The prediction equals the average child's height calculated from those points alone.\n",
"\n",
"Let's pretend we are Galton and execute this plan. For now we will just make a reasonable definition of what \"around 68 inches\" means, and work with that. Later in the course we will examine the consequences of such choices.\n",
"\n",
"We will take \"close\" to mean \"within half an inch\". The figure below shows all the points corresponding to a midparent height between 67.5 inches and 68.5 inches. These are all the points in the strip between the red lines. Each of these points corresponds to one child; our prediction of the height of the new couple's child is the average height of all the children in the strip. That's represented by the gold dot.\n",
"\n",
"Ignore the code, and just focus on understanding the mental process of arriving at that gold dot."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"heights.plot.scatter('MidParent','Child')\n",
"_ = plt.plot([67.5, 67.5], [50, 85], color='red', lw=2)\n",
"_ = plt.plot([68.5, 68.5], [50, 85], color='red', lw=2)\n",
"_ = plt.scatter(68, 66.24, color='gold', s=40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to calculate exactly where the gold dot should be, we first need to indentify all the points in the strip. These correspond to the rows where `MidParent` is between 67.5 inches and 68.5 inches."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
MidParent
\n",
"
Child
\n",
"
\n",
" \n",
" \n",
"
\n",
"
233
\n",
"
68.44
\n",
"
62.0
\n",
"
\n",
"
\n",
"
396
\n",
"
67.94
\n",
"
71.2
\n",
"
\n",
"
\n",
"
397
\n",
"
67.94
\n",
"
67.0
\n",
"
\n",
"
\n",
"
516
\n",
"
68.33
\n",
"
62.5
\n",
"
\n",
"
\n",
"
517
\n",
"
68.23
\n",
"
73.0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
885
\n",
"
67.60
\n",
"
69.0
\n",
"
\n",
"
\n",
"
886
\n",
"
67.60
\n",
"
68.0
\n",
"
\n",
"
\n",
"
887
\n",
"
67.60
\n",
"
67.7
\n",
"
\n",
"
\n",
"
888
\n",
"
67.60
\n",
"
64.5
\n",
"
\n",
"
\n",
"
889
\n",
"
67.60
\n",
"
60.5
\n",
"
\n",
" \n",
"
\n",
"
131 rows × 2 columns
\n",
"
"
],
"text/plain": [
" MidParent Child\n",
"233 68.44 62.0\n",
"396 67.94 71.2\n",
"397 67.94 67.0\n",
"516 68.33 62.5\n",
"517 68.23 73.0\n",
".. ... ...\n",
"885 67.60 69.0\n",
"886 67.60 68.0\n",
"887 67.60 67.7\n",
"888 67.60 64.5\n",
"889 67.60 60.5\n",
"\n",
"[131 rows x 2 columns]"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"close_to_68 = heights[(heights['MidParent'] >= 67.5) & (heights['MidParent'] <= 68.5)]\n",
"close_to_68"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The predicted height of a child who has a midparent height of 68 inches is the average height of the children in these rows. That's 66.24 inches."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"66.24045801526718"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"close_to_68['Child'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a way to predict the height of a child given any value of the midparent height near those in our dataset. We can define a function `predict_child` that does this. The body of the function consists of the code in the two cells above, apart from choices of names."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"def predict_child(mpht):\n",
" \"\"\"Predict the height of a child whose parents have a midparent height of mpht.\n",
" \n",
" The prediction is the average height of the children whose midparent height is\n",
" in the range mpht plus or minus 0.5.\n",
" \"\"\"\n",
" \n",
" close_points = heights[(heights['MidParent'] >= (mpht-0.5)) & (heights['MidParent'] <= (mpht + 0.5))]\n",
" return close_points['Child'].mean() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given a midparent height of 68 inches, the function `predict_child` returns the same prediction (66.24 inches) as we got earlier. The advantage of defining the function is that we can easily change the value of the predictor and get a new prediction."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"66.24045801526718"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_child(68)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"70.41578947368421"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_child(74)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How good are these predictions? We can get a sense of this by comparing the predictions with the data that we already have. To do this, we first apply the function `predict_child` to the column of `Midparent` heights, and collect the results in a new column called `Prediction`. \n",
"\n",
"In this instance we have used the `map` function, more generally we will use the `applymap` function."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"# Apply predict_child to all the midparent heights\n",
"heights_with_predictions = heights\n",
"\n",
"heights_with_predictions['Prediction'] = heights['MidParent'].map(predict_child)"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"