Multiple_Regression.html 53 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137
  1. ---
  2. redirect_from:
  3. - "/chapters/17/6/multiple-regression"
  4. interact_link: content/chapters/17/6/Multiple_Regression.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Multiple Regression
  9. prev_page:
  10. url: /chapters/17/5/Accuracy_of_the_Classifier.html
  11. title: |-
  12. The Accuracy of the Classifier
  13. next_page:
  14. url: /chapters/18/Updating_Predictions.html
  15. title: |-
  16. Updating Predictions
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell tag_remove_input">
  24. <div class="cell border-box-sizing code_cell rendered">
  25. </div>
  26. </div>
  27. <div class="jb_cell">
  28. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  29. <div class="text_cell_render border-box-sizing rendered_html">
  30. <p>Now that we have explored ways to use multiple attributes to predict a categorical variable, let us return to predicting a quantitative variable. Predicting a numerical quantity is called regression, and a commonly used method to use multiple attributes for regression is called <em>multiple linear regression</em>.</p>
  31. <h2 id="Home-Prices">Home Prices<a class="anchor-link" href="#Home-Prices"> </a></h2><p>The following dataset of house prices and attributes was collected over several years for the city of Ames, Iowa. A <a href="http://ww2.amstat.org/publications/jse/v19n3/decock.pdf">description of the dataset appears online</a>. We will focus only a subset of the columns. We will try to predict the sale price column from the other columns.</p>
  32. </div>
  33. </div>
  34. </div>
  35. </div>
  36. <div class="jb_cell">
  37. <div class="cell border-box-sizing code_cell rendered">
  38. <div class="input">
  39. <div class="inner_cell">
  40. <div class="input_area">
  41. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">all_sales</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;house.csv&#39;</span><span class="p">)</span>
  42. <span class="n">sales</span> <span class="o">=</span> <span class="n">all_sales</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Bldg Type&#39;</span><span class="p">,</span> <span class="s1">&#39;1Fam&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Sale Condition&#39;</span><span class="p">,</span> <span class="s1">&#39;Normal&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
  43. <span class="s1">&#39;SalePrice&#39;</span><span class="p">,</span> <span class="s1">&#39;1st Flr SF&#39;</span><span class="p">,</span> <span class="s1">&#39;2nd Flr SF&#39;</span><span class="p">,</span>
  44. <span class="s1">&#39;Total Bsmt SF&#39;</span><span class="p">,</span> <span class="s1">&#39;Garage Area&#39;</span><span class="p">,</span>
  45. <span class="s1">&#39;Wood Deck SF&#39;</span><span class="p">,</span> <span class="s1">&#39;Open Porch SF&#39;</span><span class="p">,</span> <span class="s1">&#39;Lot Area&#39;</span><span class="p">,</span>
  46. <span class="s1">&#39;Year Built&#39;</span><span class="p">,</span> <span class="s1">&#39;Yr Sold&#39;</span><span class="p">)</span>
  47. <span class="n">sales</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span>
  48. </pre></div>
  49. </div>
  50. </div>
  51. </div>
  52. <div class="output_wrapper">
  53. <div class="output">
  54. <div class="jb_output_wrapper }}">
  55. <div class="output_area">
  56. <div class="output_html rendered_html output_subarea output_execute_result">
  57. <table border="1" class="dataframe">
  58. <thead>
  59. <tr>
  60. <th>SalePrice</th> <th>1st Flr SF</th> <th>2nd Flr SF</th> <th>Total Bsmt SF</th> <th>Garage Area</th> <th>Wood Deck SF</th> <th>Open Porch SF</th> <th>Lot Area</th> <th>Year Built</th> <th>Yr Sold</th>
  61. </tr>
  62. </thead>
  63. <tbody>
  64. <tr>
  65. <td>35000 </td> <td>498 </td> <td>0 </td> <td>498 </td> <td>216 </td> <td>0 </td> <td>0 </td> <td>8088 </td> <td>1922 </td> <td>2006 </td>
  66. </tr>
  67. <tr>
  68. <td>39300 </td> <td>334 </td> <td>0 </td> <td>0 </td> <td>0 </td> <td>0 </td> <td>0 </td> <td>5000 </td> <td>1946 </td> <td>2007 </td>
  69. </tr>
  70. <tr>
  71. <td>40000 </td> <td>649 </td> <td>668 </td> <td>649 </td> <td>250 </td> <td>0 </td> <td>54 </td> <td>8500 </td> <td>1920 </td> <td>2008 </td>
  72. </tr>
  73. <tr>
  74. <td>45000 </td> <td>612 </td> <td>0 </td> <td>0 </td> <td>308 </td> <td>0 </td> <td>0 </td> <td>5925 </td> <td>1940 </td> <td>2009 </td>
  75. </tr>
  76. <tr>
  77. <td>52000 </td> <td>729 </td> <td>0 </td> <td>270 </td> <td>0 </td> <td>0 </td> <td>0 </td> <td>4130 </td> <td>1935 </td> <td>2008 </td>
  78. </tr>
  79. <tr>
  80. <td>52500 </td> <td>693 </td> <td>0 </td> <td>693 </td> <td>0 </td> <td>0 </td> <td>20 </td> <td>4118 </td> <td>1941 </td> <td>2006 </td>
  81. </tr>
  82. <tr>
  83. <td>55000 </td> <td>723 </td> <td>363 </td> <td>723 </td> <td>400 </td> <td>0 </td> <td>24 </td> <td>11340 </td> <td>1920 </td> <td>2008 </td>
  84. </tr>
  85. <tr>
  86. <td>55000 </td> <td>796 </td> <td>0 </td> <td>796 </td> <td>0 </td> <td>0 </td> <td>0 </td> <td>3636 </td> <td>1922 </td> <td>2008 </td>
  87. </tr>
  88. <tr>
  89. <td>57625 </td> <td>810 </td> <td>0 </td> <td>0 </td> <td>280 </td> <td>119 </td> <td>24 </td> <td>21780 </td> <td>1910 </td> <td>2009 </td>
  90. </tr>
  91. <tr>
  92. <td>58500 </td> <td>864 </td> <td>0 </td> <td>864 </td> <td>200 </td> <td>0 </td> <td>0 </td> <td>8212 </td> <td>1914 </td> <td>2010 </td>
  93. </tr>
  94. </tbody>
  95. </table>
  96. <p>... (1992 rows omitted)</p>
  97. </div>
  98. </div>
  99. </div>
  100. </div>
  101. </div>
  102. </div>
  103. </div>
  104. <div class="jb_cell">
  105. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  106. <div class="text_cell_render border-box-sizing rendered_html">
  107. <p>A histogram of sale prices shows a large amount of variability and a distribution that is clearly not normal. A long tail to the right contains a few houses that had very high prices. The short left tail does not contain any houses that sold for less than $35,000.</p>
  108. </div>
  109. </div>
  110. </div>
  111. </div>
  112. <div class="jb_cell">
  113. <div class="cell border-box-sizing code_cell rendered">
  114. <div class="input">
  115. <div class="inner_cell">
  116. <div class="input_area">
  117. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sales</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">unit</span><span class="o">=</span><span class="s1">&#39;$&#39;</span><span class="p">)</span>
  118. </pre></div>
  119. </div>
  120. </div>
  121. </div>
  122. <div class="output_wrapper">
  123. <div class="output">
  124. <div class="jb_output_wrapper }}">
  125. <div class="output_area">
  126. <div class="output_subarea output_stream output_stderr output_text">
  127. <pre>/home/choldgraf/anaconda/envs/dev/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The &#39;normed&#39; kwarg is deprecated, and has been replaced by the &#39;density&#39; kwarg.
  128. warnings.warn(&#34;The &#39;normed&#39; kwarg is deprecated, and has been &#34;
  129. </pre>
  130. </div>
  131. </div>
  132. </div>
  133. <div class="jb_output_wrapper }}">
  134. <div class="output_area">
  135. <div class="output_png output_subarea ">
  136. <img src="../../../images/chapters/17/6/Multiple_Regression_5_1.png"
  137. >
  138. </div>
  139. </div>
  140. </div>
  141. </div>
  142. </div>
  143. </div>
  144. </div>
  145. <div class="jb_cell">
  146. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  147. <div class="text_cell_render border-box-sizing rendered_html">
  148. <h4 id="Correlation">Correlation<a class="anchor-link" href="#Correlation"> </a></h4><p>No single attribute is sufficient to predict the sale price. For example, the area of first floor, measured in square feet, correlates with sale price but only explains some of its variability.</p>
  149. </div>
  150. </div>
  151. </div>
  152. </div>
  153. <div class="jb_cell">
  154. <div class="cell border-box-sizing code_cell rendered">
  155. <div class="input">
  156. <div class="inner_cell">
  157. <div class="input_area">
  158. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sales</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;1st Flr SF&#39;</span><span class="p">,</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span>
  159. </pre></div>
  160. </div>
  161. </div>
  162. </div>
  163. <div class="output_wrapper">
  164. <div class="output">
  165. <div class="jb_output_wrapper }}">
  166. <div class="output_area">
  167. <div class="output_png output_subarea ">
  168. <img src="../../../images/chapters/17/6/Multiple_Regression_7_0.png"
  169. >
  170. </div>
  171. </div>
  172. </div>
  173. </div>
  174. </div>
  175. </div>
  176. </div>
  177. <div class="jb_cell">
  178. <div class="cell border-box-sizing code_cell rendered">
  179. <div class="input">
  180. <div class="inner_cell">
  181. <div class="input_area">
  182. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">sales</span><span class="p">,</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">,</span> <span class="s1">&#39;1st Flr SF&#39;</span><span class="p">)</span>
  183. </pre></div>
  184. </div>
  185. </div>
  186. </div>
  187. <div class="output_wrapper">
  188. <div class="output">
  189. <div class="jb_output_wrapper }}">
  190. <div class="output_area">
  191. <div class="output_text output_subarea output_execute_result">
  192. <pre>0.6424662541030225</pre>
  193. </div>
  194. </div>
  195. </div>
  196. </div>
  197. </div>
  198. </div>
  199. </div>
  200. <div class="jb_cell">
  201. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  202. <div class="text_cell_render border-box-sizing rendered_html">
  203. <p>In fact, none of the individual attributes have a correlation with sale price that is above 0.7 (except for the sale price itself).</p>
  204. </div>
  205. </div>
  206. </div>
  207. </div>
  208. <div class="jb_cell">
  209. <div class="cell border-box-sizing code_cell rendered">
  210. <div class="input">
  211. <div class="inner_cell">
  212. <div class="input_area">
  213. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">sales</span><span class="o">.</span><span class="n">labels</span><span class="p">:</span>
  214. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Correlation of&#39;</span><span class="p">,</span> <span class="n">label</span><span class="p">,</span> <span class="s1">&#39;and SalePrice:</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">correlation</span><span class="p">(</span><span class="n">sales</span><span class="p">,</span> <span class="n">label</span><span class="p">,</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">))</span>
  215. </pre></div>
  216. </div>
  217. </div>
  218. </div>
  219. <div class="output_wrapper">
  220. <div class="output">
  221. <div class="jb_output_wrapper }}">
  222. <div class="output_area">
  223. <div class="output_subarea output_stream output_stdout output_text">
  224. <pre>Correlation of SalePrice and SalePrice: 1.0
  225. Correlation of 1st Flr SF and SalePrice: 0.6424662541030225
  226. Correlation of 2nd Flr SF and SalePrice: 0.3575218942800824
  227. Correlation of Total Bsmt SF and SalePrice: 0.652978626757169
  228. Correlation of Garage Area and SalePrice: 0.6385944852520443
  229. Correlation of Wood Deck SF and SalePrice: 0.3526986661950492
  230. Correlation of Open Porch SF and SalePrice: 0.3369094170263733
  231. Correlation of Lot Area and SalePrice: 0.2908234551157694
  232. Correlation of Year Built and SalePrice: 0.5651647537135916
  233. Correlation of Yr Sold and SalePrice: 0.02594857908072111
  234. </pre>
  235. </div>
  236. </div>
  237. </div>
  238. </div>
  239. </div>
  240. </div>
  241. </div>
  242. <div class="jb_cell">
  243. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  244. <div class="text_cell_render border-box-sizing rendered_html">
  245. <p>However, combining attributes can provide higher correlation. In particular, if we sum the first floor and second floor areas, the result has a higher correlation than any single attribute alone.</p>
  246. </div>
  247. </div>
  248. </div>
  249. </div>
  250. <div class="jb_cell">
  251. <div class="cell border-box-sizing code_cell rendered">
  252. <div class="input">
  253. <div class="inner_cell">
  254. <div class="input_area">
  255. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">both_floors</span> <span class="o">=</span> <span class="n">sales</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">sales</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
  256. <span class="n">correlation</span><span class="p">(</span><span class="n">sales</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Both Floors&#39;</span><span class="p">,</span> <span class="n">both_floors</span><span class="p">),</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">,</span> <span class="s1">&#39;Both Floors&#39;</span><span class="p">)</span>
  257. </pre></div>
  258. </div>
  259. </div>
  260. </div>
  261. <div class="output_wrapper">
  262. <div class="output">
  263. <div class="jb_output_wrapper }}">
  264. <div class="output_area">
  265. <div class="output_text output_subarea output_execute_result">
  266. <pre>0.7821920556134877</pre>
  267. </div>
  268. </div>
  269. </div>
  270. </div>
  271. </div>
  272. </div>
  273. </div>
  274. <div class="jb_cell">
  275. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  276. <div class="text_cell_render border-box-sizing rendered_html">
  277. <p>This high correlation indicates that we should try to use more than one attribute to predict the sale price. In a dataset with multiple observed attributes and a single numerical value to be predicted (the sale price in this case), multiple linear regression can be an effective technique.</p>
  278. <h2 id="Multiple-Linear-Regression">Multiple Linear Regression<a class="anchor-link" href="#Multiple-Linear-Regression"> </a></h2><p>In multiple linear regression, a numerical output is predicted from numerical input attributes by multiplying each attribute value by a different slope, then summing the results. In this example, the slope for the <code>1st Flr SF</code> would represent the dollars per square foot of area on the first floor of the house that should be used in our prediction.</p>
  279. <p>Before we begin prediction, we split our data randomly into a training and test set of equal size.</p>
  280. </div>
  281. </div>
  282. </div>
  283. </div>
  284. <div class="jb_cell">
  285. <div class="cell border-box-sizing code_cell rendered">
  286. <div class="input">
  287. <div class="inner_cell">
  288. <div class="input_area">
  289. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">sales</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="mi">1001</span><span class="p">)</span>
  290. <span class="nb">print</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">num_rows</span><span class="p">,</span> <span class="s1">&#39;training and&#39;</span><span class="p">,</span> <span class="n">test</span><span class="o">.</span><span class="n">num_rows</span><span class="p">,</span> <span class="s1">&#39;test instances.&#39;</span><span class="p">)</span>
  291. </pre></div>
  292. </div>
  293. </div>
  294. </div>
  295. <div class="output_wrapper">
  296. <div class="output">
  297. <div class="jb_output_wrapper }}">
  298. <div class="output_area">
  299. <div class="output_subarea output_stream output_stdout output_text">
  300. <pre>1001 training and 1001 test instances.
  301. </pre>
  302. </div>
  303. </div>
  304. </div>
  305. </div>
  306. </div>
  307. </div>
  308. </div>
  309. <div class="jb_cell">
  310. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  311. <div class="text_cell_render border-box-sizing rendered_html">
  312. <p>The slopes in multiple regression is an array that has one slope value for each attribute in an example. Predicting the sale price involves multiplying each attribute by the slope and summing the result.</p>
  313. </div>
  314. </div>
  315. </div>
  316. </div>
  317. <div class="jb_cell">
  318. <div class="cell border-box-sizing code_cell rendered">
  319. <div class="input">
  320. <div class="inner_cell">
  321. <div class="input_area">
  322. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="n">slopes</span><span class="p">,</span> <span class="n">row</span><span class="p">):</span>
  323. <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">slopes</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">row</span><span class="p">))</span>
  324. <span class="n">example_row</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  325. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Predicting sale price for:&#39;</span><span class="p">,</span> <span class="n">example_row</span><span class="p">)</span>
  326. <span class="n">example_slopes</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">example_row</span><span class="p">))</span>
  327. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Using slopes:&#39;</span><span class="p">,</span> <span class="n">example_slopes</span><span class="p">)</span>
  328. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Result:&#39;</span><span class="p">,</span> <span class="n">predict</span><span class="p">(</span><span class="n">example_slopes</span><span class="p">,</span> <span class="n">example_row</span><span class="p">))</span>
  329. </pre></div>
  330. </div>
  331. </div>
  332. </div>
  333. <div class="output_wrapper">
  334. <div class="output">
  335. <div class="jb_output_wrapper }}">
  336. <div class="output_area">
  337. <div class="output_subarea output_stream output_stdout output_text">
  338. <pre>Predicting sale price for: Row(1st Flr SF=707, 2nd Flr SF=707, Total Bsmt SF=707.0, Garage Area=403.0, Wood Deck SF=100, Open Porch SF=35, Lot Area=7750, Year Built=2002, Yr Sold=2008)
  339. Using slopes: [ 9.70697704 8.68451487 9.48574052 11.65887763 9.76283493 7.75180442
  340. 10.26963618 12.39555854 9.93561073]
  341. Result: 150011.62264018963
  342. </pre>
  343. </div>
  344. </div>
  345. </div>
  346. </div>
  347. </div>
  348. </div>
  349. </div>
  350. <div class="jb_cell">
  351. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  352. <div class="text_cell_render border-box-sizing rendered_html">
  353. <p>The result is an estimated sale price, which can be compared to the actual sale price to assess whether the slopes provide accurate predictions. Since the <code>example_slopes</code> above were chosen at random, we should not expect them to provide accurate predictions at all.</p>
  354. </div>
  355. </div>
  356. </div>
  357. </div>
  358. <div class="jb_cell">
  359. <div class="cell border-box-sizing code_cell rendered">
  360. <div class="input">
  361. <div class="inner_cell">
  362. <div class="input_area">
  363. <div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Actual sale price:&#39;</span><span class="p">,</span> <span class="n">test</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
  364. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Predicted sale price using random slopes:&#39;</span><span class="p">,</span> <span class="n">predict</span><span class="p">(</span><span class="n">example_slopes</span><span class="p">,</span> <span class="n">example_row</span><span class="p">))</span>
  365. </pre></div>
  366. </div>
  367. </div>
  368. </div>
  369. <div class="output_wrapper">
  370. <div class="output">
  371. <div class="jb_output_wrapper }}">
  372. <div class="output_area">
  373. <div class="output_subarea output_stream output_stdout output_text">
  374. <pre>Actual sale price: 176000
  375. Predicted sale price using random slopes: 150011.62264018963
  376. </pre>
  377. </div>
  378. </div>
  379. </div>
  380. </div>
  381. </div>
  382. </div>
  383. </div>
  384. <div class="jb_cell">
  385. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  386. <div class="text_cell_render border-box-sizing rendered_html">
  387. <h4 id="Least-Squares-Regression">Least Squares Regression<a class="anchor-link" href="#Least-Squares-Regression"> </a></h4><p>The next step in performing multiple regression is to define the least squares objective. We perform the prediction for each row in the training set, and then compute the root mean squared error (RMSE) of the predictions from the actual prices.</p>
  388. </div>
  389. </div>
  390. </div>
  391. </div>
  392. <div class="jb_cell">
  393. <div class="cell border-box-sizing code_cell rendered">
  394. <div class="input">
  395. <div class="inner_cell">
  396. <div class="input_area">
  397. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">train_prices</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  398. <span class="n">train_attributes</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  399. <span class="k">def</span> <span class="nf">rmse</span><span class="p">(</span><span class="n">slopes</span><span class="p">,</span> <span class="n">attributes</span><span class="p">,</span> <span class="n">prices</span><span class="p">):</span>
  400. <span class="n">errors</span> <span class="o">=</span> <span class="p">[]</span>
  401. <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">prices</span><span class="p">)):</span>
  402. <span class="n">predicted</span> <span class="o">=</span> <span class="n">predict</span><span class="p">(</span><span class="n">slopes</span><span class="p">,</span> <span class="n">attributes</span><span class="o">.</span><span class="n">row</span><span class="p">(</span><span class="n">i</span><span class="p">))</span>
  403. <span class="n">actual</span> <span class="o">=</span> <span class="n">prices</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
  404. <span class="n">errors</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">predicted</span> <span class="o">-</span> <span class="n">actual</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
  405. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">errors</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
  406. <span class="k">def</span> <span class="nf">rmse_train</span><span class="p">(</span><span class="n">slopes</span><span class="p">):</span>
  407. <span class="k">return</span> <span class="n">rmse</span><span class="p">(</span><span class="n">slopes</span><span class="p">,</span> <span class="n">train_attributes</span><span class="p">,</span> <span class="n">train_prices</span><span class="p">)</span>
  408. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;RMSE of all training examples using random slopes:&#39;</span><span class="p">,</span> <span class="n">rmse_train</span><span class="p">(</span><span class="n">example_slopes</span><span class="p">))</span>
  409. </pre></div>
  410. </div>
  411. </div>
  412. </div>
  413. <div class="output_wrapper">
  414. <div class="output">
  415. <div class="jb_output_wrapper }}">
  416. <div class="output_area">
  417. <div class="output_subarea output_stream output_stdout output_text">
  418. <pre>RMSE of all training examples using random slopes: 103585.76518182222
  419. </pre>
  420. </div>
  421. </div>
  422. </div>
  423. </div>
  424. </div>
  425. </div>
  426. </div>
  427. <div class="jb_cell">
  428. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  429. <div class="text_cell_render border-box-sizing rendered_html">
  430. <p>Finally, we use the <code>minimize</code> function to find the slopes with the lowest RMSE. Since the function we want to minimize, <code>rmse_train</code>, takes an array instead of a number, we must pass the <code>array=True</code> argument to <code>minimize</code>. When this argument is used, <code>minimize</code> also requires an initial guess of the slopes so that it knows the dimension of the input array. Finally, to speed up optimization, we indicate that <code>rmse_train</code> is a smooth function using the <code>smooth=True</code> attribute. Computation of the best slopes may take several minutes.</p>
  431. </div>
  432. </div>
  433. </div>
  434. </div>
  435. <div class="jb_cell">
  436. <div class="cell border-box-sizing code_cell rendered">
  437. <div class="input">
  438. <div class="inner_cell">
  439. <div class="input_area">
  440. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">best_slopes</span> <span class="o">=</span> <span class="n">minimize</span><span class="p">(</span><span class="n">rmse_train</span><span class="p">,</span> <span class="n">start</span><span class="o">=</span><span class="n">example_slopes</span><span class="p">,</span> <span class="n">smooth</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">array</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
  441. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;The best slopes for the training set:&#39;</span><span class="p">)</span>
  442. <span class="n">Table</span><span class="p">(</span><span class="n">train_attributes</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span><span class="o">.</span><span class="n">with_row</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">best_slopes</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
  443. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;RMSE of all training examples using the best slopes:&#39;</span><span class="p">,</span> <span class="n">rmse_train</span><span class="p">(</span><span class="n">best_slopes</span><span class="p">))</span>
  444. </pre></div>
  445. </div>
  446. </div>
  447. </div>
  448. <div class="output_wrapper">
  449. <div class="output">
  450. <div class="jb_output_wrapper }}">
  451. <div class="output_area">
  452. <div class="output_subarea output_stream output_stdout output_text">
  453. <pre>The best slopes for the training set:
  454. </pre>
  455. </div>
  456. </div>
  457. </div>
  458. <div class="jb_output_wrapper }}">
  459. <div class="output_area">
  460. <div class="output_html rendered_html output_subarea ">
  461. <table border="1" class="dataframe">
  462. <thead>
  463. <tr>
  464. <th>1st Flr SF</th> <th>2nd Flr SF</th> <th>Total Bsmt SF</th> <th>Garage Area</th> <th>Wood Deck SF</th> <th>Open Porch SF</th> <th>Lot Area</th> <th>Year Built</th> <th>Yr Sold</th>
  465. </tr>
  466. </thead>
  467. <tbody>
  468. <tr>
  469. <td>78.7701 </td> <td>75.9304 </td> <td>49.6108 </td> <td>42.9615 </td> <td>38.8186 </td> <td>13.2336 </td> <td>0.328059</td> <td>510.312 </td> <td>-508.186</td>
  470. </tr>
  471. </tbody>
  472. </table>
  473. </div>
  474. </div>
  475. </div>
  476. <div class="jb_output_wrapper }}">
  477. <div class="output_area">
  478. <div class="output_subarea output_stream output_stdout output_text">
  479. <pre>RMSE of all training examples using the best slopes: 32283.50513136445
  480. </pre>
  481. </div>
  482. </div>
  483. </div>
  484. </div>
  485. </div>
  486. </div>
  487. </div>
  488. <div class="jb_cell">
  489. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  490. <div class="text_cell_render border-box-sizing rendered_html">
  491. <h4 id="Interpreting-Multiple-Regression">Interpreting Multiple Regression<a class="anchor-link" href="#Interpreting-Multiple-Regression"> </a></h4><p>Let's interpret these results. The best slopes give us a method for estimating the price of a house from its attributes. A square foot of area on the first floor is worth about \$75 (the first slope), while one on the second floor is worth about \$70 (the second slope). The final negative value describes the market: prices in later years were lower on average.</p>
  492. <p>The RMSE of around \$30,000 means that our best linear prediction of the sale price based on all of the attributes is off by around \$30,000 on the training set, on average. We find a similar error when predicting prices on the test set, which indicates that our prediction method will generalize to other samples from the same population.</p>
  493. </div>
  494. </div>
  495. </div>
  496. </div>
  497. <div class="jb_cell">
  498. <div class="cell border-box-sizing code_cell rendered">
  499. <div class="input">
  500. <div class="inner_cell">
  501. <div class="input_area">
  502. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">test_prices</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  503. <span class="n">test_attributes</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  504. <span class="k">def</span> <span class="nf">rmse_test</span><span class="p">(</span><span class="n">slopes</span><span class="p">):</span>
  505. <span class="k">return</span> <span class="n">rmse</span><span class="p">(</span><span class="n">slopes</span><span class="p">,</span> <span class="n">test_attributes</span><span class="p">,</span> <span class="n">test_prices</span><span class="p">)</span>
  506. <span class="n">rmse_linear</span> <span class="o">=</span> <span class="n">rmse_test</span><span class="p">(</span><span class="n">best_slopes</span><span class="p">)</span>
  507. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Test set RMSE for multiple linear regression:&#39;</span><span class="p">,</span> <span class="n">rmse_linear</span><span class="p">)</span>
  508. </pre></div>
  509. </div>
  510. </div>
  511. </div>
  512. <div class="output_wrapper">
  513. <div class="output">
  514. <div class="jb_output_wrapper }}">
  515. <div class="output_area">
  516. <div class="output_subarea output_stream output_stdout output_text">
  517. <pre>Test set RMSE for multiple linear regression: 29898.407434368237
  518. </pre>
  519. </div>
  520. </div>
  521. </div>
  522. </div>
  523. </div>
  524. </div>
  525. </div>
  526. <div class="jb_cell">
  527. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  528. <div class="text_cell_render border-box-sizing rendered_html">
  529. <p>If the predictions were perfect, then a scatter plot of the predicted and actual values would be a straight line with slope 1. We see that most dots fall near that line, but there is some error in the predictions.</p>
  530. </div>
  531. </div>
  532. </div>
  533. </div>
  534. <div class="jb_cell">
  535. <div class="cell border-box-sizing code_cell rendered">
  536. <div class="input">
  537. <div class="inner_cell">
  538. <div class="input_area">
  539. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
  540. <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">best_slopes</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">row</span><span class="p">))</span>
  541. <span class="n">test</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Fitted&#39;</span><span class="p">,</span> <span class="n">test</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">fit</span><span class="p">))</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;Fitted&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
  542. <span class="n">plots</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mf">5e5</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mf">5e5</span><span class="p">]);</span>
  543. </pre></div>
  544. </div>
  545. </div>
  546. </div>
  547. <div class="output_wrapper">
  548. <div class="output">
  549. <div class="jb_output_wrapper }}">
  550. <div class="output_area">
  551. <div class="output_png output_subarea ">
  552. <img src="../../../images/chapters/17/6/Multiple_Regression_26_0.png"
  553. >
  554. </div>
  555. </div>
  556. </div>
  557. </div>
  558. </div>
  559. </div>
  560. </div>
  561. <div class="jb_cell">
  562. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  563. <div class="text_cell_render border-box-sizing rendered_html">
  564. <p>A residual plot for multiple regression typically compares the errors (residuals) to the actual values of the predicted variable. We see in the residual plot below that we have systematically underestimated the value of expensive houses, shown by the many positive residual values on the right side of the graph.</p>
  565. </div>
  566. </div>
  567. </div>
  568. </div>
  569. <div class="jb_cell">
  570. <div class="cell border-box-sizing code_cell rendered">
  571. <div class="input">
  572. <div class="inner_cell">
  573. <div class="input_area">
  574. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">test</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Residual&#39;</span><span class="p">,</span> <span class="n">test_prices</span><span class="o">-</span><span class="n">test</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">fit</span><span class="p">))</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;Residual&#39;</span><span class="p">)</span>
  575. <span class="n">plots</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mf">7e5</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]);</span>
  576. </pre></div>
  577. </div>
  578. </div>
  579. </div>
  580. <div class="output_wrapper">
  581. <div class="output">
  582. <div class="jb_output_wrapper }}">
  583. <div class="output_area">
  584. <div class="output_png output_subarea ">
  585. <img src="../../../images/chapters/17/6/Multiple_Regression_28_0.png"
  586. >
  587. </div>
  588. </div>
  589. </div>
  590. </div>
  591. </div>
  592. </div>
  593. </div>
  594. <div class="jb_cell">
  595. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  596. <div class="text_cell_render border-box-sizing rendered_html">
  597. <p>As with simple linear regression, interpreting the result of a predictor is at least as important as making predictions. There are many lessons about interpreting multiple regression that are not included in this textbook. A natural next step after completing this text would be to study linear modeling and regression in further depth.</p>
  598. </div>
  599. </div>
  600. </div>
  601. </div>
  602. <div class="jb_cell">
  603. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  604. <div class="text_cell_render border-box-sizing rendered_html">
  605. <h2 id="Nearest-Neighbors-for-Regression">Nearest Neighbors for Regression<a class="anchor-link" href="#Nearest-Neighbors-for-Regression"> </a></h2><p>Another approach to predicting the sale price of a house is to use the price of similar houses. This <em>nearest neighbor</em> approach is very similar to our classifier. To speed up computation, we will only use the attributes that had the highest correlation with the sale price in our original analysis.</p>
  606. </div>
  607. </div>
  608. </div>
  609. </div>
  610. <div class="jb_cell">
  611. <div class="cell border-box-sizing code_cell rendered">
  612. <div class="input">
  613. <div class="inner_cell">
  614. <div class="input_area">
  615. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">train_nn</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
  616. <span class="n">test_nn</span> <span class="o">=</span> <span class="n">test</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
  617. <span class="n">train_nn</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
  618. </pre></div>
  619. </div>
  620. </div>
  621. </div>
  622. <div class="output_wrapper">
  623. <div class="output">
  624. <div class="jb_output_wrapper }}">
  625. <div class="output_area">
  626. <div class="output_html rendered_html output_subarea ">
  627. <table border="1" class="dataframe">
  628. <thead>
  629. <tr>
  630. <th>SalePrice</th> <th>1st Flr SF</th> <th>2nd Flr SF</th> <th>Total Bsmt SF</th> <th>Garage Area</th> <th>Year Built</th>
  631. </tr>
  632. </thead>
  633. <tbody>
  634. <tr>
  635. <td>67500 </td> <td>1012 </td> <td>0 </td> <td>816 </td> <td>429 </td> <td>1920 </td>
  636. </tr>
  637. <tr>
  638. <td>116000 </td> <td>734 </td> <td>384 </td> <td>648 </td> <td>440 </td> <td>1920 </td>
  639. </tr>
  640. <tr>
  641. <td>228500 </td> <td>1689 </td> <td>0 </td> <td>1680 </td> <td>432 </td> <td>1991 </td>
  642. </tr>
  643. </tbody>
  644. </table>
  645. <p>... (998 rows omitted)</p>
  646. </div>
  647. </div>
  648. </div>
  649. </div>
  650. </div>
  651. </div>
  652. </div>
  653. <div class="jb_cell">
  654. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  655. <div class="text_cell_render border-box-sizing rendered_html">
  656. <p>The computation of closest neighbors is identical to a nearest-neighbor classifier. In this case, we will exclude the <code>'SalePrice'</code> rather than the <code>'Class'</code> column from the distance computation. The five nearest neighbors of the first test row are shown below.</p>
  657. </div>
  658. </div>
  659. </div>
  660. </div>
  661. <div class="jb_cell">
  662. <div class="cell border-box-sizing code_cell rendered">
  663. <div class="input">
  664. <div class="inner_cell">
  665. <div class="input_area">
  666. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">distance</span><span class="p">(</span><span class="n">pt1</span><span class="p">,</span> <span class="n">pt2</span><span class="p">):</span>
  667. <span class="sd">&quot;&quot;&quot;The distance between two points, represented as arrays.&quot;&quot;&quot;</span>
  668. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">sum</span><span class="p">((</span><span class="n">pt1</span> <span class="o">-</span> <span class="n">pt2</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
  669. <span class="k">def</span> <span class="nf">row_distance</span><span class="p">(</span><span class="n">row1</span><span class="p">,</span> <span class="n">row2</span><span class="p">):</span>
  670. <span class="sd">&quot;&quot;&quot;The distance between two rows of a table.&quot;&quot;&quot;</span>
  671. <span class="k">return</span> <span class="n">distance</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">row1</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">row2</span><span class="p">))</span>
  672. <span class="k">def</span> <span class="nf">distances</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">example</span><span class="p">,</span> <span class="n">output</span><span class="p">):</span>
  673. <span class="sd">&quot;&quot;&quot;Compute the distance from example for each row in training.&quot;&quot;&quot;</span>
  674. <span class="n">dists</span> <span class="o">=</span> <span class="p">[]</span>
  675. <span class="n">attributes</span> <span class="o">=</span> <span class="n">training</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>
  676. <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">attributes</span><span class="o">.</span><span class="n">rows</span><span class="p">:</span>
  677. <span class="n">dists</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row_distance</span><span class="p">(</span><span class="n">row</span><span class="p">,</span> <span class="n">example</span><span class="p">))</span>
  678. <span class="k">return</span> <span class="n">training</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Distance&#39;</span><span class="p">,</span> <span class="n">dists</span><span class="p">)</span>
  679. <span class="k">def</span> <span class="nf">closest</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">example</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">output</span><span class="p">):</span>
  680. <span class="sd">&quot;&quot;&quot;Return a table of the k closest neighbors to example.&quot;&quot;&quot;</span>
  681. <span class="k">return</span> <span class="n">distances</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">example</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">&#39;Distance&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">k</span><span class="p">))</span>
  682. <span class="n">example_nn_row</span> <span class="o">=</span> <span class="n">test_nn</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  683. <span class="n">closest</span><span class="p">(</span><span class="n">train_nn</span><span class="p">,</span> <span class="n">example_nn_row</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span>
  684. </pre></div>
  685. </div>
  686. </div>
  687. </div>
  688. <div class="output_wrapper">
  689. <div class="output">
  690. <div class="jb_output_wrapper }}">
  691. <div class="output_area">
  692. <div class="output_html rendered_html output_subarea output_execute_result">
  693. <table border="1" class="dataframe">
  694. <thead>
  695. <tr>
  696. <th>SalePrice</th> <th>1st Flr SF</th> <th>2nd Flr SF</th> <th>Total Bsmt SF</th> <th>Garage Area</th> <th>Year Built</th> <th>Distance</th>
  697. </tr>
  698. </thead>
  699. <tbody>
  700. <tr>
  701. <td>175000 </td> <td>729 </td> <td>717 </td> <td>729 </td> <td>406 </td> <td>1996 </td> <td>33.3617 </td>
  702. </tr>
  703. <tr>
  704. <td>176000 </td> <td>728 </td> <td>728 </td> <td>728 </td> <td>400 </td> <td>2005 </td> <td>36.6197 </td>
  705. </tr>
  706. <tr>
  707. <td>189000 </td> <td>728 </td> <td>728 </td> <td>728 </td> <td>410 </td> <td>2005 </td> <td>37.1618 </td>
  708. </tr>
  709. <tr>
  710. <td>159500 </td> <td>698 </td> <td>728 </td> <td>690 </td> <td>440 </td> <td>1977 </td> <td>52.9623 </td>
  711. </tr>
  712. <tr>
  713. <td>174000 </td> <td>742 </td> <td>742 </td> <td>742 </td> <td>390 </td> <td>2005 </td> <td>62.0725 </td>
  714. </tr>
  715. </tbody>
  716. </table>
  717. </div>
  718. </div>
  719. </div>
  720. </div>
  721. </div>
  722. </div>
  723. </div>
  724. <div class="jb_cell">
  725. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  726. <div class="text_cell_render border-box-sizing rendered_html">
  727. <p>One simple method for predicting the price is to average the prices of the nearest neighbors.</p>
  728. </div>
  729. </div>
  730. </div>
  731. </div>
  732. <div class="jb_cell">
  733. <div class="cell border-box-sizing code_cell rendered">
  734. <div class="input">
  735. <div class="inner_cell">
  736. <div class="input_area">
  737. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">predict_nn</span><span class="p">(</span><span class="n">example</span><span class="p">):</span>
  738. <span class="sd">&quot;&quot;&quot;Return the majority class among the k nearest neighbors.&quot;&quot;&quot;</span>
  739. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">(</span><span class="n">closest</span><span class="p">(</span><span class="n">train_nn</span><span class="p">,</span> <span class="n">example</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">))</span>
  740. <span class="n">predict_nn</span><span class="p">(</span><span class="n">example_nn_row</span><span class="p">)</span>
  741. </pre></div>
  742. </div>
  743. </div>
  744. </div>
  745. <div class="output_wrapper">
  746. <div class="output">
  747. <div class="jb_output_wrapper }}">
  748. <div class="output_area">
  749. <div class="output_text output_subarea output_execute_result">
  750. <pre>174700.0</pre>
  751. </div>
  752. </div>
  753. </div>
  754. </div>
  755. </div>
  756. </div>
  757. </div>
  758. <div class="jb_cell">
  759. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  760. <div class="text_cell_render border-box-sizing rendered_html">
  761. <p>Finally, we can inspect whether our prediction is close to the true sale price for our one test example. Looks reasonable!</p>
  762. </div>
  763. </div>
  764. </div>
  765. </div>
  766. <div class="jb_cell">
  767. <div class="cell border-box-sizing code_cell rendered">
  768. <div class="input">
  769. <div class="inner_cell">
  770. <div class="input_area">
  771. <div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Actual sale price:&#39;</span><span class="p">,</span> <span class="n">test_nn</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">item</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
  772. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Predicted sale price using nearest neighbors:&#39;</span><span class="p">,</span> <span class="n">predict_nn</span><span class="p">(</span><span class="n">example_nn_row</span><span class="p">))</span>
  773. </pre></div>
  774. </div>
  775. </div>
  776. </div>
  777. <div class="output_wrapper">
  778. <div class="output">
  779. <div class="jb_output_wrapper }}">
  780. <div class="output_area">
  781. <div class="output_subarea output_stream output_stdout output_text">
  782. <pre>Actual sale price: 176000
  783. Predicted sale price using nearest neighbors: 174700.0
  784. </pre>
  785. </div>
  786. </div>
  787. </div>
  788. </div>
  789. </div>
  790. </div>
  791. </div>
  792. <div class="jb_cell">
  793. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  794. <div class="text_cell_render border-box-sizing rendered_html">
  795. <h4 id="Evaluation">Evaluation<a class="anchor-link" href="#Evaluation"> </a></h4><p>To evaluate the performance of this approach for the whole test set, we apply <code>predict_nn</code> to each test example, then compute the root mean squared error of the predictions. Computation of the predictions may take several minutes.</p>
  796. </div>
  797. </div>
  798. </div>
  799. </div>
  800. <div class="jb_cell">
  801. <div class="cell border-box-sizing code_cell rendered">
  802. <div class="input">
  803. <div class="inner_cell">
  804. <div class="input_area">
  805. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">nn_test_predictions</span> <span class="o">=</span> <span class="n">test_nn</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">&#39;SalePrice&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">predict_nn</span><span class="p">)</span>
  806. <span class="n">rmse_nn</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">test_prices</span> <span class="o">-</span> <span class="n">nn_test_predictions</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
  807. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Test set RMSE for multiple linear regression: &#39;</span><span class="p">,</span> <span class="n">rmse_linear</span><span class="p">)</span>
  808. <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Test set RMSE for nearest neighbor regression:&#39;</span><span class="p">,</span> <span class="n">rmse_nn</span><span class="p">)</span>
  809. </pre></div>
  810. </div>
  811. </div>
  812. </div>
  813. <div class="output_wrapper">
  814. <div class="output">
  815. <div class="jb_output_wrapper }}">
  816. <div class="output_area">
  817. <div class="output_subarea output_stream output_stdout output_text">
  818. <pre>Test set RMSE for multiple linear regression: 29898.407434368237
  819. Test set RMSE for nearest neighbor regression: 33424.833033298106
  820. </pre>
  821. </div>
  822. </div>
  823. </div>
  824. </div>
  825. </div>
  826. </div>
  827. </div>
  828. <div class="jb_cell">
  829. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  830. <div class="text_cell_render border-box-sizing rendered_html">
  831. <p>For these data, the errors of the two techniques are quite similar! For different data sets, one technique might outperform another. By computing the RMSE of both techniques on the same data, we can compare methods fairly. One note of caution: the difference in performance might not be due to the technique at all; it might be due to the random variation due to sampling the training and test sets in the first place.</p>
  832. <p>Finally, we can draw a residual plot for these predictions. We still underestimate the prices of the most expensive houses, but the bias does not appear to be as systematic. However, fewer residuals are very close to zero, indicating that fewer prices were predicted with very high accuracy.</p>
  833. </div>
  834. </div>
  835. </div>
  836. </div>
  837. <div class="jb_cell">
  838. <div class="cell border-box-sizing code_cell rendered">
  839. <div class="input">
  840. <div class="inner_cell">
  841. <div class="input_area">
  842. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">test</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Residual&#39;</span><span class="p">,</span> <span class="n">test_prices</span><span class="o">-</span><span class="n">nn_test_predictions</span><span class="p">)</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;Residual&#39;</span><span class="p">)</span>
  843. <span class="n">plots</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mf">7e5</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]);</span>
  844. </pre></div>
  845. </div>
  846. </div>
  847. </div>
  848. <div class="output_wrapper">
  849. <div class="output">
  850. <div class="jb_output_wrapper }}">
  851. <div class="output_area">
  852. <div class="output_png output_subarea ">
  853. <img src="../../../images/chapters/17/6/Multiple_Regression_41_0.png"
  854. >
  855. </div>
  856. </div>
  857. </div>
  858. </div>
  859. </div>
  860. </div>
  861. </div>