{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "\n", "import functools\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from scipy import optimize\n", "\n", "def minimize(f, start=None, smooth=False, log=None, array=False, **vargs):\n", " \"\"\"Minimize a function f of one or more arguments.\n", " Args:\n", " f: A function that takes numbers and returns a number\n", " start: A starting value or list of starting values\n", " smooth: Whether to assume that f is smooth and use first-order info\n", " log: Logging function called on the result of optimization (e.g. print)\n", " vargs: Other named arguments passed to scipy.optimize.minimize\n", " Returns either:\n", " (a) the minimizing argument of a one-argument function\n", " (b) an array of minimizing arguments of a multi-argument function\n", " \"\"\"\n", " if start is None:\n", " assert not array, \"Please pass starting values explicitly when array=True\"\n", " arg_count = f.__code__.co_argcount\n", " assert arg_count > 0, \"Please pass starting values explicitly for variadic functions\"\n", " start = [0] * arg_count\n", " if not hasattr(start, '__len__'):\n", " start = [start]\n", "\n", " if array:\n", " objective = f\n", " else:\n", " @functools.wraps(f)\n", " def objective(args):\n", " return f(*args)\n", "\n", " if not smooth and 'method' not in vargs:\n", " vargs['method'] = 'Powell'\n", " result = optimize.minimize(objective, start, **vargs)\n", " if log is not None:\n", " log(result)\n", " if len(start) == 1:\n", " return result.x.item(0)\n", " else:\n", " return result.x" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "\n", "def standard_units(any_numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (any_numbers - np.mean(any_numbers))/np.std(any_numbers) \n", "\n", "def correlation(t, x, y):\n", " return np.mean(standard_units(t[x])*standard_units(t[y]))\n", "\n", "def slope(table, x, y):\n", " r = correlation(table, x, y)\n", " return r * np.std(table[y]/np.std(table[x]))\n", "\n", "def intercept(table, x, y):\n", " a = slope(table, x, y)\n", " return np.mean(table[y]) - a * np.mean(table[x])\n", "\n", "def fit(table, x, y):\n", " \"\"\"Return the height of the regression line at each x value.\"\"\"\n", " a = slope(table, x, y)\n", " b = intercept(table, x, y)\n", " return a * table[x] + b" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Least Squares Regression\n", "In an earlier section, we developed formulas for the slope and intercept of the regression line through a *football shaped* scatter diagram. It turns out that the slope and intercept of the least squares line have the same formulas as those we developed, *regardless of the shape of the scatter plot*.\n", "\n", "We saw this in the example about Little Women, but let's confirm it in an example where the scatter plot clearly isn't football shaped. For the data, we are once again indebted to the rich [data archive of Prof. Larry Winner](http://www.stat.ufl.edu/~winner/datasets.html) of the University of Florida. A [2013 study](http://digitalcommons.wku.edu/ijes/vol6/iss2/10/) in the International Journal of Exercise Science studied collegiate shot put athletes and examined the relation between strength and shot put distance. The population consists of 28 female collegiate athletes. Strength was measured by the the biggest amount (in kilograms) that the athlete lifted in the \"1RM power clean\" in the pre-season. The distance (in meters) was the athlete's personal best." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "shotput = pd.read_csv(path_data + 'shotput.csv')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Weight Lifted | \n", "Shot Put Distance | \n", "
---|---|---|
0 | \n", "37.5 | \n", "6.4 | \n", "
1 | \n", "51.5 | \n", "10.2 | \n", "
2 | \n", "61.3 | \n", "12.4 | \n", "
3 | \n", "61.3 | \n", "13.0 | \n", "
4 | \n", "63.6 | \n", "13.2 | \n", "
5 | \n", "66.1 | \n", "13.0 | \n", "
6 | \n", "70.0 | \n", "12.7 | \n", "
7 | \n", "92.7 | \n", "13.9 | \n", "
8 | \n", "90.5 | \n", "15.5 | \n", "
9 | \n", "90.5 | \n", "15.8 | \n", "