Prediction.html 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297
  1. ---
  2. redirect_from:
  3. - "/chapters/15/prediction"
  4. interact_link: content/chapters/15/Prediction.ipynb
  5. kernel_name: Python [Root]
  6. has_widgets: false
  7. title: |-
  8. Prediction
  9. prev_page:
  10. url: /chapters/14/6/Choosing_a_Sample_Size.html
  11. title: |-
  12. Choosing a Sample Size
  13. next_page:
  14. url: /chapters/15/1/Correlation.html
  15. title: |-
  16. Correlation
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell">
  24. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  25. <div class="text_cell_render border-box-sizing rendered_html">
  26. <h3 id="Prediction">Prediction<a class="anchor-link" href="#Prediction"> </a></h3><p>An important aspect of data science is to find out what data can tell us about the future. What do data about climate and pollution say about temperatures a few decades from now? Based on a person's internet profile, which websites are likely to interest them? How can a patient's medical history be used to judge how well he or she will respond to a treatment?</p>
  27. <p>To answer such questions, data scientists have developed methods for making <em>predictions</em>. In this chapter we will study one of the most commonly used ways of predicting the value of one variable based on the value of another.</p>
  28. </div>
  29. </div>
  30. </div>
  31. </div>
  32. <div class="jb_cell">
  33. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  34. <div class="text_cell_render border-box-sizing rendered_html">
  35. <p>The foundations of the method were laid by <a href="https://en.wikipedia.org/wiki/Francis_Galton">Sir Francis Galton</a>. As we saw in Section 7.1, Galton studied how physical characteristics are passed down from one generation to the next. Among his best known work is the prediction of the heights of adults based on the heights of their parents. We have studied the dataset that Galton collected for this. The table <code>heights</code> contains his data on the midparent height and child's height (all in inches) for a population of 934 adult "children".</p>
  36. </div>
  37. </div>
  38. </div>
  39. </div>
  40. <div class="jb_cell">
  41. <div class="cell border-box-sizing code_cell rendered">
  42. <div class="input">
  43. <div class="inner_cell">
  44. <div class="input_area">
  45. <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Galton&#39;s data on heights of parents and their adult children</span>
  46. <span class="n">galton</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;galton.csv&#39;</span><span class="p">)</span>
  47. <span class="n">heights</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  48. <span class="s1">&#39;MidParent&#39;</span><span class="p">,</span> <span class="n">galton</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;midparentHeight&#39;</span><span class="p">),</span>
  49. <span class="s1">&#39;Child&#39;</span><span class="p">,</span> <span class="n">galton</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;childHeight&#39;</span><span class="p">)</span>
  50. <span class="p">)</span>
  51. </pre></div>
  52. </div>
  53. </div>
  54. </div>
  55. </div>
  56. </div>
  57. <div class="jb_cell">
  58. <div class="cell border-box-sizing code_cell rendered">
  59. <div class="input">
  60. <div class="inner_cell">
  61. <div class="input_area">
  62. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">heights</span>
  63. </pre></div>
  64. </div>
  65. </div>
  66. </div>
  67. <div class="output_wrapper">
  68. <div class="output">
  69. <div class="jb_output_wrapper }}">
  70. <div class="output_area">
  71. <div class="output_html rendered_html output_subarea output_execute_result">
  72. <table border="1" class="dataframe">
  73. <thead>
  74. <tr>
  75. <th>MidParent</th> <th>Child</th>
  76. </tr>
  77. </thead>
  78. <tbody>
  79. <tr>
  80. <td>75.43 </td> <td>73.2 </td>
  81. </tr>
  82. <tr>
  83. <td>75.43 </td> <td>69.2 </td>
  84. </tr>
  85. <tr>
  86. <td>75.43 </td> <td>69 </td>
  87. </tr>
  88. <tr>
  89. <td>75.43 </td> <td>69 </td>
  90. </tr>
  91. <tr>
  92. <td>73.66 </td> <td>73.5 </td>
  93. </tr>
  94. <tr>
  95. <td>73.66 </td> <td>72.5 </td>
  96. </tr>
  97. <tr>
  98. <td>73.66 </td> <td>65.5 </td>
  99. </tr>
  100. <tr>
  101. <td>73.66 </td> <td>65.5 </td>
  102. </tr>
  103. <tr>
  104. <td>72.06 </td> <td>71 </td>
  105. </tr>
  106. <tr>
  107. <td>72.06 </td> <td>68 </td>
  108. </tr>
  109. </tbody>
  110. </table>
  111. <p>... (924 rows omitted)</p>
  112. </div>
  113. </div>
  114. </div>
  115. </div>
  116. </div>
  117. </div>
  118. </div>
  119. <div class="jb_cell">
  120. <div class="cell border-box-sizing code_cell rendered">
  121. <div class="input">
  122. <div class="inner_cell">
  123. <div class="input_area">
  124. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">heights</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;MidParent&#39;</span><span class="p">)</span>
  125. </pre></div>
  126. </div>
  127. </div>
  128. </div>
  129. <div class="output_wrapper">
  130. <div class="output">
  131. <div class="jb_output_wrapper }}">
  132. <div class="output_area">
  133. <div class="output_png output_subarea ">
  134. <img src="../../images/chapters/15/Prediction_5_0.png"
  135. >
  136. </div>
  137. </div>
  138. </div>
  139. </div>
  140. </div>
  141. </div>
  142. </div>
  143. <div class="jb_cell">
  144. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  145. <div class="text_cell_render border-box-sizing rendered_html">
  146. <p>The primary reason for collecting the data was to be able to predict the adult height of a child born to parents similar to those in the dataset. We made these predictions in Section 7.1, after noticing the positive association between the two variables.</p>
  147. <p>Our approach was to base the prediction on all the points that correspond to a midparent height of around the midparent height of the new person. To do this, we wrote a function called <code>predict_child</code> which takes a midparent height as its argument and returns the average height of all the children who had midparent heights within half an inch of the argument.</p>
  148. </div>
  149. </div>
  150. </div>
  151. </div>
  152. <div class="jb_cell">
  153. <div class="cell border-box-sizing code_cell rendered">
  154. <div class="input">
  155. <div class="inner_cell">
  156. <div class="input_area">
  157. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">predict_child</span><span class="p">(</span><span class="n">mpht</span><span class="p">):</span>
  158. <span class="sd">&quot;&quot;&quot;Return a prediction of the height of a child </span>
  159. <span class="sd"> whose parents have a midparent height of mpht.</span>
  160. <span class="sd"> </span>
  161. <span class="sd"> The prediction is the average height of the children </span>
  162. <span class="sd"> whose midparent height is in the range mpht plus or minus 0.5 inches.</span>
  163. <span class="sd"> &quot;&quot;&quot;</span>
  164. <span class="n">close_points</span> <span class="o">=</span> <span class="n">heights</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;MidParent&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">between</span><span class="p">(</span><span class="n">mpht</span><span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">mpht</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">))</span>
  165. <span class="k">return</span> <span class="n">close_points</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Child&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
  166. </pre></div>
  167. </div>
  168. </div>
  169. </div>
  170. </div>
  171. </div>
  172. <div class="jb_cell">
  173. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  174. <div class="text_cell_render border-box-sizing rendered_html">
  175. <p>We applied the function to the column of <code>Midparent</code> heights, visualized our results.</p>
  176. </div>
  177. </div>
  178. </div>
  179. </div>
  180. <div class="jb_cell">
  181. <div class="cell border-box-sizing code_cell rendered">
  182. <div class="input">
  183. <div class="inner_cell">
  184. <div class="input_area">
  185. <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Apply predict_child to all the midparent heights</span>
  186. <span class="n">heights_with_predictions</span> <span class="o">=</span> <span class="n">heights</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
  187. <span class="s1">&#39;Prediction&#39;</span><span class="p">,</span> <span class="n">heights</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">predict_child</span><span class="p">,</span> <span class="s1">&#39;MidParent&#39;</span><span class="p">)</span>
  188. <span class="p">)</span>
  189. </pre></div>
  190. </div>
  191. </div>
  192. </div>
  193. </div>
  194. </div>
  195. <div class="jb_cell">
  196. <div class="cell border-box-sizing code_cell rendered">
  197. <div class="input">
  198. <div class="inner_cell">
  199. <div class="input_area">
  200. <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Draw the original scatter plot along with the predicted values</span>
  201. <span class="n">heights_with_predictions</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;MidParent&#39;</span><span class="p">)</span>
  202. </pre></div>
  203. </div>
  204. </div>
  205. </div>
  206. <div class="output_wrapper">
  207. <div class="output">
  208. <div class="jb_output_wrapper }}">
  209. <div class="output_area">
  210. <div class="output_png output_subarea ">
  211. <img src="../../images/chapters/15/Prediction_10_0.png"
  212. >
  213. </div>
  214. </div>
  215. </div>
  216. </div>
  217. </div>
  218. </div>
  219. </div>
  220. <div class="jb_cell">
  221. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  222. <div class="text_cell_render border-box-sizing rendered_html">
  223. <p>The prediction at a given midparent height lies roughly at the center of the vertical strip of points at the given height. This method of prediction is called <em>regression.</em> Later in this chapter we will see where this term came from. We will also see whether we can avoid our arbitrary definitions of "closeness" being "within 0.5 inches". But first we will develop a measure that can be used in many settings to decide how good one variable will be as a predictor of another.</p>
  224. </div>
  225. </div>
  226. </div>
  227. </div>