Least_Squares_Regression.html 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580
  1. ---
  2. redirect_from:
  3. - "/chapters/15/4/least-squares-regression"
  4. interact_link: content/chapters/15/4/Least_Squares_Regression.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Least Squares Regression
  9. prev_page:
  10. url: /chapters/15/3/Method_of_Least_Squares.html
  11. title: |-
  12. The Method of Least Squares
  13. next_page:
  14. url: /chapters/15/5/Visual_Diagnostics.html
  15. title: |-
  16. Visual Diagnostics
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell tag_remove_input">
  24. <div class="cell border-box-sizing code_cell rendered">
  25. </div>
  26. </div>
  27. <div class="jb_cell">
  28. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  29. <div class="text_cell_render border-box-sizing rendered_html">
  30. <h3 id="Least-Squares-Regression">Least Squares Regression<a class="anchor-link" href="#Least-Squares-Regression"> </a></h3><p>In an earlier section, we developed formulas for the slope and intercept of the regression line through a <em>football shaped</em> scatter diagram. It turns out that the slope and intercept of the least squares line have the same formulas as those we developed, <em>regardless of the shape of the scatter plot</em>.</p>
  31. <p>We saw this in the example about Little Women, but let's confirm it in an example where the scatter plot clearly isn't football shaped. For the data, we are once again indebted to the rich <a href="http://www.stat.ufl.edu/~winner/datasets.html">data archive of Prof. Larry Winner</a> of the University of Florida. A <a href="http://digitalcommons.wku.edu/ijes/vol6/iss2/10/">2013 study</a> in the International Journal of Exercise Science studied collegiate shot put athletes and examined the relation between strength and shot put distance. The population consists of 28 female collegiate athletes. Strength was measured by the the biggest amount (in kilograms) that the athlete lifted in the "1RM power clean" in the pre-season. The distance (in meters) was the athlete's personal best.</p>
  32. </div>
  33. </div>
  34. </div>
  35. </div>
  36. <div class="jb_cell">
  37. <div class="cell border-box-sizing code_cell rendered">
  38. <div class="input">
  39. <div class="inner_cell">
  40. <div class="input_area">
  41. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">shotput</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;shotput.csv&#39;</span><span class="p">)</span>
  42. </pre></div>
  43. </div>
  44. </div>
  45. </div>
  46. </div>
  47. </div>
  48. <div class="jb_cell">
  49. <div class="cell border-box-sizing code_cell rendered">
  50. <div class="input">
  51. <div class="inner_cell">
  52. <div class="input_area">
  53. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">shotput</span>
  54. </pre></div>
  55. </div>
  56. </div>
  57. </div>
  58. <div class="output_wrapper">
  59. <div class="output">
  60. <div class="jb_output_wrapper }}">
  61. <div class="output_area">
  62. <div class="output_html rendered_html output_subarea output_execute_result">
  63. <table border="1" class="dataframe">
  64. <thead>
  65. <tr>
  66. <th>Weight Lifted</th> <th>Shot Put Distance</th>
  67. </tr>
  68. </thead>
  69. <tbody>
  70. <tr>
  71. <td>37.5 </td> <td>6.4 </td>
  72. </tr>
  73. <tr>
  74. <td>51.5 </td> <td>10.2 </td>
  75. </tr>
  76. <tr>
  77. <td>61.3 </td> <td>12.4 </td>
  78. </tr>
  79. <tr>
  80. <td>61.3 </td> <td>13 </td>
  81. </tr>
  82. <tr>
  83. <td>63.6 </td> <td>13.2 </td>
  84. </tr>
  85. <tr>
  86. <td>66.1 </td> <td>13 </td>
  87. </tr>
  88. <tr>
  89. <td>70 </td> <td>12.7 </td>
  90. </tr>
  91. <tr>
  92. <td>92.7 </td> <td>13.9 </td>
  93. </tr>
  94. <tr>
  95. <td>90.5 </td> <td>15.5 </td>
  96. </tr>
  97. <tr>
  98. <td>90.5 </td> <td>15.8 </td>
  99. </tr>
  100. </tbody>
  101. </table>
  102. <p>... (18 rows omitted)</p>
  103. </div>
  104. </div>
  105. </div>
  106. </div>
  107. </div>
  108. </div>
  109. </div>
  110. <div class="jb_cell">
  111. <div class="cell border-box-sizing code_cell rendered">
  112. <div class="input">
  113. <div class="inner_cell">
  114. <div class="input_area">
  115. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">shotput</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;Weight Lifted&#39;</span><span class="p">)</span>
  116. </pre></div>
  117. </div>
  118. </div>
  119. </div>
  120. <div class="output_wrapper">
  121. <div class="output">
  122. <div class="jb_output_wrapper }}">
  123. <div class="output_area">
  124. <div class="output_png output_subarea ">
  125. <img src="../../../images/chapters/15/4/Least_Squares_Regression_5_0.png"
  126. >
  127. </div>
  128. </div>
  129. </div>
  130. </div>
  131. </div>
  132. </div>
  133. </div>
  134. <div class="jb_cell">
  135. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  136. <div class="text_cell_render border-box-sizing rendered_html">
  137. <p>That's not a football shaped scatter plot. In fact, it seems to have a slight non-linear component. But if we insist on using a straight line to make our predictions, there is still one best straight line among all straight lines.</p>
  138. <p>Our formulas for the slope and intercept of the regression line, derived for football shaped scatter plots, give the following values.</p>
  139. </div>
  140. </div>
  141. </div>
  142. </div>
  143. <div class="jb_cell">
  144. <div class="cell border-box-sizing code_cell rendered">
  145. <div class="input">
  146. <div class="inner_cell">
  147. <div class="input_area">
  148. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">slope</span><span class="p">(</span><span class="n">shotput</span><span class="p">,</span> <span class="s1">&#39;Weight Lifted&#39;</span><span class="p">,</span> <span class="s1">&#39;Shot Put Distance&#39;</span><span class="p">)</span>
  149. </pre></div>
  150. </div>
  151. </div>
  152. </div>
  153. <div class="output_wrapper">
  154. <div class="output">
  155. <div class="jb_output_wrapper }}">
  156. <div class="output_area">
  157. <div class="output_text output_subarea output_execute_result">
  158. <pre>0.09834382159781997</pre>
  159. </div>
  160. </div>
  161. </div>
  162. </div>
  163. </div>
  164. </div>
  165. </div>
  166. <div class="jb_cell">
  167. <div class="cell border-box-sizing code_cell rendered">
  168. <div class="input">
  169. <div class="inner_cell">
  170. <div class="input_area">
  171. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">intercept</span><span class="p">(</span><span class="n">shotput</span><span class="p">,</span> <span class="s1">&#39;Weight Lifted&#39;</span><span class="p">,</span> <span class="s1">&#39;Shot Put Distance&#39;</span><span class="p">)</span>
  172. </pre></div>
  173. </div>
  174. </div>
  175. </div>
  176. <div class="output_wrapper">
  177. <div class="output">
  178. <div class="jb_output_wrapper }}">
  179. <div class="output_area">
  180. <div class="output_text output_subarea output_execute_result">
  181. <pre>5.959629098373952</pre>
  182. </div>
  183. </div>
  184. </div>
  185. </div>
  186. </div>
  187. </div>
  188. </div>
  189. <div class="jb_cell">
  190. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  191. <div class="text_cell_render border-box-sizing rendered_html">
  192. <p>Does it still make sense to use these formulas even though the scatter plot isn't football shaped? We can answer this by finding the slope and intercept of the line that minimizes the mse.</p>
  193. <p>We will define the function <code>shotput_linear_mse</code> to take an arbirtary slope and intercept as arguments and return the corresponding mse. Then <code>minimize</code> applied to <code>shotput_linear_mse</code> will return the best slope and intercept.</p>
  194. </div>
  195. </div>
  196. </div>
  197. </div>
  198. <div class="jb_cell">
  199. <div class="cell border-box-sizing code_cell rendered">
  200. <div class="input">
  201. <div class="inner_cell">
  202. <div class="input_area">
  203. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">shotput_linear_mse</span><span class="p">(</span><span class="n">any_slope</span><span class="p">,</span> <span class="n">any_intercept</span><span class="p">):</span>
  204. <span class="n">x</span> <span class="o">=</span> <span class="n">shotput</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Weight Lifted&#39;</span><span class="p">)</span>
  205. <span class="n">y</span> <span class="o">=</span> <span class="n">shotput</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Shot Put Distance&#39;</span><span class="p">)</span>
  206. <span class="n">fitted</span> <span class="o">=</span> <span class="n">any_slope</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">any_intercept</span>
  207. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">y</span> <span class="o">-</span> <span class="n">fitted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
  208. </pre></div>
  209. </div>
  210. </div>
  211. </div>
  212. </div>
  213. </div>
  214. <div class="jb_cell">
  215. <div class="cell border-box-sizing code_cell rendered">
  216. <div class="input">
  217. <div class="inner_cell">
  218. <div class="input_area">
  219. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">minimize</span><span class="p">(</span><span class="n">shotput_linear_mse</span><span class="p">)</span>
  220. </pre></div>
  221. </div>
  222. </div>
  223. </div>
  224. <div class="output_wrapper">
  225. <div class="output">
  226. <div class="jb_output_wrapper }}">
  227. <div class="output_area">
  228. <div class="output_text output_subarea output_execute_result">
  229. <pre>array([0.09834382, 5.95962911])</pre>
  230. </div>
  231. </div>
  232. </div>
  233. </div>
  234. </div>
  235. </div>
  236. </div>
  237. <div class="jb_cell">
  238. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  239. <div class="text_cell_render border-box-sizing rendered_html">
  240. <p>These values are the same as those we got by using our formulas. To summarize:</p>
  241. <p><strong>No matter what the shape of the scatter plot, there is a unique line that minimizes the mean squared error of estimation. It is called the regression line, and its slope and intercept are given by</strong></p>
  242. $$
  243. \mathbf{\mbox{slope of the regression line}} ~=~ r \cdot
  244. \frac{\mbox{SD of }y}{\mbox{SD of }x}
  245. $$$$
  246. \mathbf{\mbox{intercept of the regression line}} ~=~
  247. \mbox{average of }y ~-~ \mbox{slope} \cdot \mbox{average of }x
  248. $$
  249. </div>
  250. </div>
  251. </div>
  252. </div>
  253. <div class="jb_cell">
  254. <div class="cell border-box-sizing code_cell rendered">
  255. <div class="input">
  256. <div class="inner_cell">
  257. <div class="input_area">
  258. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">fitted</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">shotput</span><span class="p">,</span> <span class="s1">&#39;Weight Lifted&#39;</span><span class="p">,</span> <span class="s1">&#39;Shot Put Distance&#39;</span><span class="p">)</span>
  259. <span class="n">shotput</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Best Straight Line&#39;</span><span class="p">,</span> <span class="n">fitted</span><span class="p">)</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;Weight Lifted&#39;</span><span class="p">)</span>
  260. </pre></div>
  261. </div>
  262. </div>
  263. </div>
  264. <div class="output_wrapper">
  265. <div class="output">
  266. <div class="jb_output_wrapper }}">
  267. <div class="output_area">
  268. <div class="output_png output_subarea ">
  269. <img src="../../../images/chapters/15/4/Least_Squares_Regression_13_0.png"
  270. >
  271. </div>
  272. </div>
  273. </div>
  274. </div>
  275. </div>
  276. </div>
  277. </div>
  278. <div class="jb_cell">
  279. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  280. <div class="text_cell_render border-box-sizing rendered_html">
  281. <h3 id="Nonlinear-Regression">Nonlinear Regression<a class="anchor-link" href="#Nonlinear-Regression"> </a></h3><p>The graph above reinforces our earlier observation that the scatter plot is a bit curved. So it is better to fit a curve than a straight line. The <a href="http://digitalcommons.wku.edu/ijes/vol6/iss2/10/">study</a> postulated a quadratic relation between the weight lifted and the shot put distance. So let's use quadratic functions as our predictors and see if we can find the best one.</p>
  282. <p>We have to find the best quadratic function among all quadratic functions, instead of the best straight line among all straight lines. The method of least squares allows us to do this.</p>
  283. <p>The mathematics of this minimization is complicated and not easy to see just by examining the scatter plot. But numerical minimization is just as easy as it was with linear predictors! We can get the best quadratic predictor by once again using <code>minimize</code>. Let's see how this works.</p>
  284. </div>
  285. </div>
  286. </div>
  287. </div>
  288. <div class="jb_cell">
  289. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  290. <div class="text_cell_render border-box-sizing rendered_html">
  291. <p>Recall that a quadratic function has the form</p>
  292. $$
  293. f(x) ~=~ ax^2 + bx + c
  294. $$<p>for constants $a$, $b$, and $c$.</p>
  295. <p>To find the best quadratic function to predict distance based on weight lifted, using the criterion of least squares, we will first write a function that takes the three constants as its arguments, calculates the fitted values by using the quadratic function above, and then returns the mean squared error.</p>
  296. <p>The function is called <code>shotput_quadratic_mse</code>. Notice that the definition is analogous to that of <code>lw_mse</code>, except that the fitted values are based on a quadratic function instead of linear.</p>
  297. </div>
  298. </div>
  299. </div>
  300. </div>
  301. <div class="jb_cell">
  302. <div class="cell border-box-sizing code_cell rendered">
  303. <div class="input">
  304. <div class="inner_cell">
  305. <div class="input_area">
  306. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">shotput_quadratic_mse</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
  307. <span class="n">x</span> <span class="o">=</span> <span class="n">shotput</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Weight Lifted&#39;</span><span class="p">)</span>
  308. <span class="n">y</span> <span class="o">=</span> <span class="n">shotput</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Shot Put Distance&#39;</span><span class="p">)</span>
  309. <span class="n">fitted</span> <span class="o">=</span> <span class="n">a</span><span class="o">*</span><span class="p">(</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">c</span>
  310. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">y</span> <span class="o">-</span> <span class="n">fitted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
  311. </pre></div>
  312. </div>
  313. </div>
  314. </div>
  315. </div>
  316. </div>
  317. <div class="jb_cell">
  318. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  319. <div class="text_cell_render border-box-sizing rendered_html">
  320. <p>We can now use <code>minimize</code> just as before to find the constants that minimize the mean squared error.</p>
  321. </div>
  322. </div>
  323. </div>
  324. </div>
  325. <div class="jb_cell">
  326. <div class="cell border-box-sizing code_cell rendered">
  327. <div class="input">
  328. <div class="inner_cell">
  329. <div class="input_area">
  330. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">best</span> <span class="o">=</span> <span class="n">minimize</span><span class="p">(</span><span class="n">shotput_quadratic_mse</span><span class="p">)</span>
  331. <span class="n">best</span>
  332. </pre></div>
  333. </div>
  334. </div>
  335. </div>
  336. <div class="output_wrapper">
  337. <div class="output">
  338. <div class="jb_output_wrapper }}">
  339. <div class="output_area">
  340. <div class="output_text output_subarea output_execute_result">
  341. <pre>array([-1.04004838e-03, 2.82708045e-01, -1.53182115e+00])</pre>
  342. </div>
  343. </div>
  344. </div>
  345. </div>
  346. </div>
  347. </div>
  348. </div>
  349. <div class="jb_cell">
  350. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  351. <div class="text_cell_render border-box-sizing rendered_html">
  352. <p>Our prediction of the shot put distance for an athlete who lifts $x$ kilograms is about
  353. $$
  354. -0.00104x^2 ~+~ 0.2827x - 1.5318
  355. $$
  356. meters. For example, if the athlete can lift 100 kilograms, the predicted distance is 16.33 meters. On the scatter plot, that's near the center of a vertical strip around 100 kilograms.</p>
  357. </div>
  358. </div>
  359. </div>
  360. </div>
  361. <div class="jb_cell">
  362. <div class="cell border-box-sizing code_cell rendered">
  363. <div class="input">
  364. <div class="inner_cell">
  365. <div class="input_area">
  366. <div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="o">-</span><span class="mf">0.00104</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">100</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mf">0.2827</span><span class="o">*</span><span class="mi">100</span> <span class="o">-</span> <span class="mf">1.5318</span>
  367. </pre></div>
  368. </div>
  369. </div>
  370. </div>
  371. <div class="output_wrapper">
  372. <div class="output">
  373. <div class="jb_output_wrapper }}">
  374. <div class="output_area">
  375. <div class="output_text output_subarea output_execute_result">
  376. <pre>16.3382</pre>
  377. </div>
  378. </div>
  379. </div>
  380. </div>
  381. </div>
  382. </div>
  383. </div>
  384. <div class="jb_cell">
  385. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  386. <div class="text_cell_render border-box-sizing rendered_html">
  387. <p>Here are the predictions for all the values of <code>Weight Lifted</code>. You can see that they go through the center of the scatter plot, to a rough approximation.</p>
  388. </div>
  389. </div>
  390. </div>
  391. </div>
  392. <div class="jb_cell">
  393. <div class="cell border-box-sizing code_cell rendered">
  394. <div class="input">
  395. <div class="inner_cell">
  396. <div class="input_area">
  397. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">x</span> <span class="o">=</span> <span class="n">shotput</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  398. <span class="n">shotput_fit</span> <span class="o">=</span> <span class="n">best</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">best</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">best</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
  399. </pre></div>
  400. </div>
  401. </div>
  402. </div>
  403. </div>
  404. </div>
  405. <div class="jb_cell">
  406. <div class="cell border-box-sizing code_cell rendered">
  407. <div class="input">
  408. <div class="inner_cell">
  409. <div class="input_area">
  410. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">shotput</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Best Quadratic Curve&#39;</span><span class="p">,</span> <span class="n">shotput_fit</span><span class="p">)</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  411. </pre></div>
  412. </div>
  413. </div>
  414. </div>
  415. <div class="output_wrapper">
  416. <div class="output">
  417. <div class="jb_output_wrapper }}">
  418. <div class="output_area">
  419. <div class="output_png output_subarea ">
  420. <img src="../../../images/chapters/15/4/Least_Squares_Regression_23_0.png"
  421. >
  422. </div>
  423. </div>
  424. </div>
  425. </div>
  426. </div>
  427. </div>
  428. </div>