{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# HIDDEN \n", "\n", "galton = pd.read_csv(path_data + 'galton.csv')\n", "heights = galton[['midparentHeight', 'childHeight']]\n", "\n", "heights = heights.rename(columns={'midparentHeight':'MidParent', 'childHeight':'Child'})\n", "\n", "hybrid = pd.read_csv(path_data + 'hybrid.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "def standard_units(x):\n", " return (x - np.mean(x))/np.std(x)\n", "\n", "def correlation(table, x, y):\n", " x_in_standard_units = standard_units(table[x])\n", " y_in_standard_units = standard_units(table[y])\n", " return np.mean(x_in_standard_units * y_in_standard_units)\n", "\n", "def slope(table, x, y):\n", " r = correlation(table, x, y)\n", " return r * np.std(table[y])/np.std(table[x])\n", "\n", "def intercept(table, x, y):\n", " a = slope(table, x, y)\n", " return np.mean(table[y]) - a * np.mean(table[x])\n", "\n", "def fit(table, x, y):\n", " a = slope(table, x, y)\n", " b = intercept(table, x, y)\n", " return a * table[x] + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visual Diagnostics\n", "Suppose a data scientist has decided to use linear regression to estimate values of one variable (called the response variable) based on another variable (called the predictor). To see how well this method of estimation performs, the data scientist must measure how far off the estimates are from the actual values. These differences are called *residuals*.\n", "\n", "$$\n", "\\mbox{residual} ~=~ \\mbox{observed value} ~-~ \\mbox{regression estimate}\n", "$$\n", "\n", "A residual is what's left over – the residue – after estimation. \n", "\n", "Residuals are the vertical distances of the points from the regression line. There is one residual for each point in the scatter plot. The residual is the difference between the observed value of $y$ and the fitted value of $y$, so for the point $(x, y)$,\n", "\n", "$$\n", "\\mbox{residual} ~~ = ~~ y ~-~\n", "\\mbox{fitted value of }y\n", "~~ = ~~ y ~-~\n", "\\mbox{height of regression line at }x\n", "$$\n", "\n", "The function `residual` calculates the residuals. The calculation assumes all the relevant functions we have already defined: `standard_units`, `correlation`, `slope`, `intercept`, and `fit`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def residual(table, x, y):\n", " return table[y] - fit(table, x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continuing our example of using Galton's data to estimate the heights of adult children (the response) based on the midparent height (the predictor), let us calculate the fitted values and the residuals." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | MidParent | \n", "Child | \n", "Fitted Vlaue | \n", "Residual | \n", "
---|---|---|---|---|
0 | \n", "75.43 | \n", "73.2 | \n", "70.712373 | \n", "2.487627 | \n", "
1 | \n", "75.43 | \n", "69.2 | \n", "70.712373 | \n", "-1.512373 | \n", "
2 | \n", "75.43 | \n", "69.0 | \n", "70.712373 | \n", "-1.712373 | \n", "
3 | \n", "75.43 | \n", "69.0 | \n", "70.712373 | \n", "-1.712373 | \n", "
4 | \n", "73.66 | \n", "73.5 | \n", "69.584244 | \n", "3.915756 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
929 | \n", "66.64 | \n", "64.0 | \n", "65.109971 | \n", "-1.109971 | \n", "
930 | \n", "66.64 | \n", "62.0 | \n", "65.109971 | \n", "-3.109971 | \n", "
931 | \n", "66.64 | \n", "61.0 | \n", "65.109971 | \n", "-4.109971 | \n", "
932 | \n", "65.27 | \n", "66.5 | \n", "64.236786 | \n", "2.263214 | \n", "
933 | \n", "65.27 | \n", "57.0 | \n", "64.236786 | \n", "-7.236786 | \n", "
934 rows × 4 columns
\n", "\n", " | Length | \n", "Age | \n", "
---|---|---|
0 | \n", "1.80 | \n", "1.0 | \n", "
1 | \n", "1.85 | \n", "1.5 | \n", "
2 | \n", "1.87 | \n", "1.5 | \n", "
3 | \n", "1.77 | \n", "1.5 | \n", "
4 | \n", "2.02 | \n", "2.5 | \n", "
5 | \n", "2.27 | \n", "4.0 | \n", "
6 | \n", "2.15 | \n", "5.0 | \n", "
7 | \n", "2.26 | \n", "5.0 | \n", "
8 | \n", "2.35 | \n", "7.0 | \n", "
9 | \n", "2.47 | \n", "8.0 | \n", "