Correlation.html 57 KB


  1. ---
  2. redirect_from:
  3. - "/chapters/15/1/correlation"
  4. interact_link: content/chapters/15/1/Correlation.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Correlation
  9. prev_page:
  10. url: /chapters/15/Prediction.html
  11. title: |-
  12. Prediction
  13. next_page:
  14. url: /chapters/15/2/Regression_Line.html
  15. title: |-
  16. The Regression Line
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell tag_remove_input">
  24. <div class="cell border-box-sizing code_cell rendered">
  25. </div>
  26. </div>
  27. <div class="jb_cell">
  28. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  29. <div class="text_cell_render border-box-sizing rendered_html">
  30. <h3 id="Correlation">Correlation<a class="anchor-link" href="#Correlation"> </a></h3><p>In this section we will develop a measure of how tightly clustered a scatter diagram is about a straight line. Formally, this is called measuring <em>linear association</em>.</p>
  31. </div>
  32. </div>
  33. </div>
  34. </div>
  35. <div class="jb_cell">
  36. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  37. <div class="text_cell_render border-box-sizing rendered_html">
  38. <p>The table <code>hybrid</code> contains data on hybrid passenger cars sold in the United States from 1997 to 2013. The data were adapted from the online data archive of <a href="http://www.stat.ufl.edu/%7Ewinner/">Prof. Larry Winner</a> of the University of Florida. The columns:</p>
  39. <ul>
  40. <li><code>vehicle</code>: model of the car</li>
  41. <li><code>year</code>: year of manufacture</li>
  42. <li><code>msrp</code>: manufacturer's suggested retail price in 2013 dollars</li>
  43. <li><code>acceleration</code>: acceleration rate in km per hour per second</li>
  44. <li><code>mpg</code>: fuel econonmy in miles per gallon</li>
  45. <li><code>class</code>: the model's class.</li>
  46. </ul>
  47. </div>
  48. </div>
  49. </div>
  50. </div>
  51. <div class="jb_cell">
  52. <div class="cell border-box-sizing code_cell rendered">
  53. <div class="input">
  54. <div class="inner_cell">
  55. <div class="input_area">
  56. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">hybrid</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;hybrid.csv&#39;</span><span class="p">)</span>
  57. </pre></div>
  58. </div>
  59. </div>
  60. </div>
  61. </div>
  62. </div>
  63. <div class="jb_cell">
  64. <div class="cell border-box-sizing code_cell rendered">
  65. <div class="input">
  66. <div class="inner_cell">
  67. <div class="input_area">
  68. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">hybrid</span>
  69. </pre></div>
  70. </div>
  71. </div>
  72. </div>
  73. <div class="output_wrapper">
  74. <div class="output">
  75. <div class="jb_output_wrapper }}">
  76. <div class="output_area">
  77. <div class="output_html rendered_html output_subarea output_execute_result">
  78. <table border="1" class="dataframe">
  79. <thead>
  80. <tr>
  81. <th>vehicle</th> <th>year</th> <th>msrp</th> <th>acceleration</th> <th>mpg</th> <th>class</th>
  82. </tr>
  83. </thead>
  84. <tbody>
  85. <tr>
  86. <td>Prius (1st Gen)</td> <td>1997</td> <td>24509.7</td> <td>7.46 </td> <td>41.26</td> <td>Compact </td>
  87. </tr>
  88. <tr>
  89. <td>Tino </td> <td>2000</td> <td>35355 </td> <td>8.2 </td> <td>54.1 </td> <td>Compact </td>
  90. </tr>
  91. <tr>
  92. <td>Prius (2nd Gen)</td> <td>2000</td> <td>26832.2</td> <td>7.97 </td> <td>45.23</td> <td>Compact </td>
  93. </tr>
  94. <tr>
  95. <td>Insight </td> <td>2000</td> <td>18936.4</td> <td>9.52 </td> <td>53 </td> <td>Two Seater</td>
  96. </tr>
  97. <tr>
  98. <td>Civic (1st Gen)</td> <td>2001</td> <td>25833.4</td> <td>7.04 </td> <td>47.04</td> <td>Compact </td>
  99. </tr>
  100. <tr>
  101. <td>Insight </td> <td>2001</td> <td>19036.7</td> <td>9.52 </td> <td>53 </td> <td>Two Seater</td>
  102. </tr>
  103. <tr>
  104. <td>Insight </td> <td>2002</td> <td>19137 </td> <td>9.71 </td> <td>53 </td> <td>Two Seater</td>
  105. </tr>
  106. <tr>
  107. <td>Alphard </td> <td>2003</td> <td>38084.8</td> <td>8.33 </td> <td>40.46</td> <td>Minivan </td>
  108. </tr>
  109. <tr>
  110. <td>Insight </td> <td>2003</td> <td>19137 </td> <td>9.52 </td> <td>53 </td> <td>Two Seater</td>
  111. </tr>
  112. <tr>
  113. <td>Civic </td> <td>2003</td> <td>14071.9</td> <td>8.62 </td> <td>41 </td> <td>Compact </td>
  114. </tr>
  115. </tbody>
  116. </table>
  117. <p>... (143 rows omitted)</p>
  118. </div>
  119. </div>
  120. </div>
  121. </div>
  122. </div>
  123. </div>
  124. </div>
  125. <div class="jb_cell">
  126. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  127. <div class="text_cell_render border-box-sizing rendered_html">
  128. <p>The graph below is a scatter plot of <code>msrp</code> <em>versus</em> <code>acceleration</code>. That means <code>msrp</code> is plotted on the vertical axis and <code>accelaration</code> on the horizontal.</p>
  129. </div>
  130. </div>
  131. </div>
  132. </div>
  133. <div class="jb_cell">
  134. <div class="cell border-box-sizing code_cell rendered">
  135. <div class="input">
  136. <div class="inner_cell">
  137. <div class="input_area">
  138. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">hybrid</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;acceleration&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  139. </pre></div>
  140. </div>
  141. </div>
  142. </div>
  143. <div class="output_wrapper">
  144. <div class="output">
  145. <div class="jb_output_wrapper }}">
  146. <div class="output_area">
  147. <div class="output_png output_subarea ">
  148. <img src="../../../images/chapters/15/1/Correlation_7_0.png"
  149. >
  150. </div>
  151. </div>
  152. </div>
  153. </div>
  154. </div>
  155. </div>
  156. </div>
  157. <div class="jb_cell">
  158. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  159. <div class="text_cell_render border-box-sizing rendered_html">
  160. <p>Notice the positive association. The scatter of points is sloping upwards, indicating that cars with greater acceleration tended to cost more, on average; conversely, the cars that cost more tended to have greater acceleration on average.</p>
  161. <p>The scatter diagram of MSRP versus mileage shows a negative association. Hybrid cars with higher mileage tended to cost less, on average. This seems surprising till you consider that cars that accelerate fast tend to be less fuel efficient and have lower mileage. As the previous scatter plot showed, those were also the cars that tended to cost more.</p>
  162. </div>
  163. </div>
  164. </div>
  165. </div>
  166. <div class="jb_cell">
  167. <div class="cell border-box-sizing code_cell rendered">
  168. <div class="input">
  169. <div class="inner_cell">
  170. <div class="input_area">
  171. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">hybrid</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;mpg&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  172. </pre></div>
  173. </div>
  174. </div>
  175. </div>
  176. <div class="output_wrapper">
  177. <div class="output">
  178. <div class="jb_output_wrapper }}">
  179. <div class="output_area">
  180. <div class="output_png output_subarea ">
  181. <img src="../../../images/chapters/15/1/Correlation_9_0.png"
  182. >
  183. </div>
  184. </div>
  185. </div>
  186. </div>
  187. </div>
  188. </div>
  189. </div>
  190. <div class="jb_cell">
  191. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  192. <div class="text_cell_render border-box-sizing rendered_html">
  193. <p>Along with the negative association, the scatter diagram of price versus efficiency shows a non-linear relation between the two variables. The points appear to be clustered around a curve, not around a straight line.</p>
  194. <p>If we restrict the data just to the SUV class, however, the association between price and efficiency is still negative but the relation appears to be more linear. The relation between the price and acceleration of SUV's also shows a linear trend, but with a positive slope.</p>
  195. </div>
  196. </div>
  197. </div>
  198. </div>
  199. <div class="jb_cell">
  200. <div class="cell border-box-sizing code_cell rendered">
  201. <div class="input">
  202. <div class="inner_cell">
  203. <div class="input_area">
  204. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">suv</span> <span class="o">=</span> <span class="n">hybrid</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;class&#39;</span><span class="p">,</span> <span class="s1">&#39;SUV&#39;</span><span class="p">)</span>
  205. <span class="n">suv</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;mpg&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  206. </pre></div>
  207. </div>
  208. </div>
  209. </div>
  210. <div class="output_wrapper">
  211. <div class="output">
  212. <div class="jb_output_wrapper }}">
  213. <div class="output_area">
  214. <div class="output_png output_subarea ">
  215. <img src="../../../images/chapters/15/1/Correlation_11_0.png"
  216. >
  217. </div>
  218. </div>
  219. </div>
  220. </div>
  221. </div>
  222. </div>
  223. </div>
  224. <div class="jb_cell">
  225. <div class="cell border-box-sizing code_cell rendered">
  226. <div class="input">
  227. <div class="inner_cell">
  228. <div class="input_area">
  229. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">suv</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;acceleration&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  230. </pre></div>
  231. </div>
  232. </div>
  233. </div>
  234. <div class="output_wrapper">
  235. <div class="output">
  236. <div class="jb_output_wrapper }}">
  237. <div class="output_area">
  238. <div class="output_png output_subarea ">
  239. <img src="../../../images/chapters/15/1/Correlation_12_0.png"
  240. >
  241. </div>
  242. </div>
  243. </div>
  244. </div>
  245. </div>
  246. </div>
  247. </div>
  248. <div class="jb_cell">
  249. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  250. <div class="text_cell_render border-box-sizing rendered_html">
  251. <p>You will have noticed that we can derive useful information from the general orientation and shape of a scatter diagram even without paying attention to the units in which the variables were measured.</p>
  252. <p>Indeed, we could plot all the variables in standard units and the plots would look the same. This gives us a way to compare the degree of linearity in two scatter diagrams.</p>
  253. <p>Recall that in an earlier section we defined the function <code>standard_units</code> to convert an array of numbers to standard units.</p>
  254. </div>
  255. </div>
  256. </div>
  257. </div>
  258. <div class="jb_cell">
  259. <div class="cell border-box-sizing code_cell rendered">
  260. <div class="input">
  261. <div class="inner_cell">
  262. <div class="input_area">
  263. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">standard_units</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">):</span>
  264. <span class="s2">&quot;Convert any array of numbers to standard units.&quot;</span>
  265. <span class="k">return</span> <span class="p">(</span><span class="n">any_numbers</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">)</span>
  266. </pre></div>
  267. </div>
  268. </div>
  269. </div>
  270. </div>
  271. </div>
  272. <div class="jb_cell">
  273. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  274. <div class="text_cell_render border-box-sizing rendered_html">
  275. <p>We can use this function to re-draw the two scatter diagrams for SUVs, with all the variables measured in standard units.</p>
  276. </div>
  277. </div>
  278. </div>
  279. </div>
  280. <div class="jb_cell">
  281. <div class="cell border-box-sizing code_cell rendered">
  282. <div class="input">
  283. <div class="inner_cell">
  284. <div class="input_area">
  285. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  286. <span class="s1">&#39;mpg (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">suv</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;mpg&#39;</span><span class="p">)),</span>
  287. <span class="s1">&#39;msrp (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">suv</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;msrp&#39;</span><span class="p">))</span>
  288. <span class="p">)</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
  289. <span class="n">plots</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
  290. <span class="n">plots</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
  291. </pre></div>
  292. </div>
  293. </div>
  294. </div>
  295. <div class="output_wrapper">
  296. <div class="output">
  297. <div class="jb_output_wrapper }}">
  298. <div class="output_area">
  299. <div class="output_png output_subarea ">
  300. <img src="../../../images/chapters/15/1/Correlation_16_0.png"
  301. >
  302. </div>
  303. </div>
  304. </div>
  305. </div>
  306. </div>
  307. </div>
  308. </div>
  309. <div class="jb_cell">
  310. <div class="cell border-box-sizing code_cell rendered">
  311. <div class="input">
  312. <div class="inner_cell">
  313. <div class="input_area">
  314. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  315. <span class="s1">&#39;acceleration (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">suv</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;acceleration&#39;</span><span class="p">)),</span>
  316. <span class="s1">&#39;msrp (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">suv</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;msrp&#39;</span><span class="p">))</span>
  317. <span class="p">)</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
  318. <span class="n">plots</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
  319. <span class="n">plots</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
  320. </pre></div>
  321. </div>
  322. </div>
  323. </div>
  324. <div class="output_wrapper">
  325. <div class="output">
  326. <div class="jb_output_wrapper }}">
  327. <div class="output_area">
  328. <div class="output_png output_subarea ">
  329. <img src="../../../images/chapters/15/1/Correlation_17_0.png"
  330. >
  331. </div>
  332. </div>
  333. </div>
  334. </div>
  335. </div>
  336. </div>
  337. </div>
  338. <div class="jb_cell">
  339. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  340. <div class="text_cell_render border-box-sizing rendered_html">
  341. <p>The associations that we see in these figures are the same as those we saw before. Also, because the two scatter diagrams are now drawn on exactly the same scale, we can see that the linear relation in the second diagram is a little more fuzzy than in the first.</p>
  342. <p>We will now define a measure that uses standard units to quantify the kinds of association that we have seen.</p>
  343. </div>
  344. </div>
  345. </div>
  346. </div>
  347. <div class="jb_cell">
  348. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  349. <div class="text_cell_render border-box-sizing rendered_html">
  350. <h3 id="The-correlation-coefficient">The correlation coefficient<a class="anchor-link" href="#The-correlation-coefficient"> </a></h3><p>The <em>correlation coefficient</em> measures the strength of the linear relationship between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.</p>
  351. <p>The term <em>correlation coefficient</em> isn't easy to say, so it is usually shortened to <em>correlation</em> and denoted by $r$.</p>
  352. <p>Here are some mathematical facts about $r$ that we will just observe by simulation.</p>
  353. <ul>
  354. <li>The correlation coefficient $r$ is a number between $-1$ and 1.</li>
  355. <li>$r$ measures the extent to which the scatter plot clusters around a straight line.</li>
  356. <li>$r = 1$ if the scatter diagram is a perfect straight line sloping upwards, and $r = -1$ if the scatter diagram is a perfect straight line sloping downwards.</li>
  357. </ul>
  358. </div>
  359. </div>
  360. </div>
  361. </div>
  362. <div class="jb_cell">
  363. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  364. <div class="text_cell_render border-box-sizing rendered_html">
  365. <p>The function <code>r_scatter</code> takes a value of $r$ as its argument and simulates a scatter plot with a correlation very close to $r$. Because of randomness in the simulation, the correlation is not expected to be exactly equal to $r$.</p>
  366. <p>Call <code>r_scatter</code> a few times, with different values of $r$ as the argument, and see how the scatter plot changes.</p>
  367. <p>When $r=1$ the scatter plot is perfectly linear and slopes upward. When $r=-1$, the scatter plot is perfectly linear and slopes downward. When $r=0$, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be <em>uncorrelated</em>.</p>
  368. </div>
  369. </div>
  370. </div>
  371. </div>
  372. <div class="jb_cell">
  373. <div class="cell border-box-sizing code_cell rendered">
  374. <div class="input">
  375. <div class="inner_cell">
  376. <div class="input_area">
  377. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">r_scatter</span><span class="p">(</span><span class="mf">0.9</span><span class="p">)</span>
  378. </pre></div>
  379. </div>
  380. </div>
  381. </div>
  382. <div class="output_wrapper">
  383. <div class="output">
  384. <div class="jb_output_wrapper }}">
  385. <div class="output_area">
  386. <div class="output_png output_subarea ">
  387. <img src="../../../images/chapters/15/1/Correlation_21_0.png"
  388. >
  389. </div>
  390. </div>
  391. </div>
  392. </div>
  393. </div>
  394. </div>
  395. </div>
  396. <div class="jb_cell">
  397. <div class="cell border-box-sizing code_cell rendered">
  398. <div class="input">
  399. <div class="inner_cell">
  400. <div class="input_area">
  401. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">r_scatter</span><span class="p">(</span><span class="mf">0.25</span><span class="p">)</span>
  402. </pre></div>
  403. </div>
  404. </div>
  405. </div>
  406. <div class="output_wrapper">
  407. <div class="output">
  408. <div class="jb_output_wrapper }}">
  409. <div class="output_area">
  410. <div class="output_png output_subarea ">
  411. <img src="../../../images/chapters/15/1/Correlation_22_0.png"
  412. >
  413. </div>
  414. </div>
  415. </div>
  416. </div>
  417. </div>
  418. </div>
  419. </div>
  420. <div class="jb_cell">
  421. <div class="cell border-box-sizing code_cell rendered">
  422. <div class="input">
  423. <div class="inner_cell">
  424. <div class="input_area">
  425. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">r_scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
  426. </pre></div>
  427. </div>
  428. </div>
  429. </div>
  430. <div class="output_wrapper">
  431. <div class="output">
  432. <div class="jb_output_wrapper }}">
  433. <div class="output_area">
  434. <div class="output_png output_subarea ">
  435. <img src="../../../images/chapters/15/1/Correlation_23_0.png"
  436. >
  437. </div>
  438. </div>
  439. </div>
  440. </div>
  441. </div>
  442. </div>
  443. </div>
  444. <div class="jb_cell">
  445. <div class="cell border-box-sizing code_cell rendered">
  446. <div class="input">
  447. <div class="inner_cell">
  448. <div class="input_area">
  449. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">r_scatter</span><span class="p">(</span><span class="o">-</span><span class="mf">0.55</span><span class="p">)</span>
  450. </pre></div>
  451. </div>
  452. </div>
  453. </div>
  454. <div class="output_wrapper">
  455. <div class="output">
  456. <div class="jb_output_wrapper }}">
  457. <div class="output_area">
  458. <div class="output_png output_subarea ">
  459. <img src="../../../images/chapters/15/1/Correlation_24_0.png"
  460. >
  461. </div>
  462. </div>
  463. </div>
  464. </div>
  465. </div>
  466. </div>
  467. </div>
  468. <div class="jb_cell">
  469. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  470. <div class="text_cell_render border-box-sizing rendered_html">
  471. <h3 id="Calculating-$r$">Calculating $r$<a class="anchor-link" href="#Calculating-$r$"> </a></h3><p>The formula for $r$ is not apparent from our observations so far. It has a mathematical basis that is outside the scope of this class. However, as you will see, the calculation is straightforward and helps us understand several of the properties of $r$.</p>
  472. <p><strong>Formula for $r$</strong>:</p>
  473. <p><strong>$r$ is the average of the products of the two variables, when both variables are measured in standard units.</strong></p>
  474. <p>Here are the steps in the calculation. We will apply the steps to a simple table of values of $x$ and $y$.</p>
  475. </div>
  476. </div>
  477. </div>
  478. </div>
  479. <div class="jb_cell">
  480. <div class="cell border-box-sizing code_cell rendered">
  481. <div class="input">
  482. <div class="inner_cell">
  483. <div class="input_area">
  484. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
  485. <span class="n">y</span> <span class="o">=</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
  486. <span class="n">t</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  487. <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span>
  488. <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">y</span>
  489. <span class="p">)</span>
  490. <span class="n">t</span>
  491. </pre></div>
  492. </div>
  493. </div>
  494. </div>
  495. <div class="output_wrapper">
  496. <div class="output">
  497. <div class="jb_output_wrapper }}">
  498. <div class="output_area">
  499. <div class="output_html rendered_html output_subarea output_execute_result">
  500. <table border="1" class="dataframe">
  501. <thead>
  502. <tr>
  503. <th>x</th> <th>y</th>
  504. </tr>
  505. </thead>
  506. <tbody>
  507. <tr>
  508. <td>1 </td> <td>2 </td>
  509. </tr>
  510. <tr>
  511. <td>2 </td> <td>3 </td>
  512. </tr>
  513. <tr>
  514. <td>3 </td> <td>1 </td>
  515. </tr>
  516. <tr>
  517. <td>4 </td> <td>5 </td>
  518. </tr>
  519. <tr>
  520. <td>5 </td> <td>2 </td>
  521. </tr>
  522. <tr>
  523. <td>6 </td> <td>7 </td>
  524. </tr>
  525. </tbody>
  526. </table>
  527. </div>
  528. </div>
  529. </div>
  530. </div>
  531. </div>
  532. </div>
  533. </div>
  534. <div class="jb_cell">
  535. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  536. <div class="text_cell_render border-box-sizing rendered_html">
  537. <p>Based on the scatter diagram, we expect that $r$ will be positive but not equal to 1.</p>
  538. </div>
  539. </div>
  540. </div>
  541. </div>
  542. <div class="jb_cell">
  543. <div class="cell border-box-sizing code_cell rendered">
  544. <div class="input">
  545. <div class="inner_cell">
  546. <div class="input_area">
  547. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">t</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;red&#39;</span><span class="p">)</span>
  548. </pre></div>
  549. </div>
  550. </div>
  551. </div>
  552. <div class="output_wrapper">
  553. <div class="output">
  554. <div class="jb_output_wrapper }}">
  555. <div class="output_area">
  556. <div class="output_png output_subarea ">
  557. <img src="../../../images/chapters/15/1/Correlation_28_0.png"
  558. >
  559. </div>
  560. </div>
  561. </div>
  562. </div>
  563. </div>
  564. </div>
  565. </div>
  566. <div class="jb_cell">
  567. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  568. <div class="text_cell_render border-box-sizing rendered_html">
  569. <p><strong>Step 1.</strong> Convert each variable to standard units.</p>
  570. </div>
  571. </div>
  572. </div>
  573. </div>
  574. <div class="jb_cell">
  575. <div class="cell border-box-sizing code_cell rendered">
  576. <div class="input">
  577. <div class="inner_cell">
  578. <div class="input_area">
  579. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">t_su</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  580. <span class="s1">&#39;x (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">x</span><span class="p">),</span>
  581. <span class="s1">&#39;y (standard units)&#39;</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
  582. <span class="p">)</span>
  583. <span class="n">t_su</span>
  584. </pre></div>
  585. </div>
  586. </div>
  587. </div>
  588. <div class="output_wrapper">
  589. <div class="output">
  590. <div class="jb_output_wrapper }}">
  591. <div class="output_area">
  592. <div class="output_html rendered_html output_subarea output_execute_result">
  593. <table border="1" class="dataframe">
  594. <thead>
  595. <tr>
  596. <th>x</th> <th>y</th> <th>x (standard units)</th> <th>y (standard units)</th>
  597. </tr>
  598. </thead>
  599. <tbody>
  600. <tr>
  601. <td>1 </td> <td>2 </td> <td>-1.46385 </td> <td>-0.648886 </td>
  602. </tr>
  603. <tr>
  604. <td>2 </td> <td>3 </td> <td>-0.87831 </td> <td>-0.162221 </td>
  605. </tr>
  606. <tr>
  607. <td>3 </td> <td>1 </td> <td>-0.29277 </td> <td>-1.13555 </td>
  608. </tr>
  609. <tr>
  610. <td>4 </td> <td>5 </td> <td>0.29277 </td> <td>0.811107 </td>
  611. </tr>
  612. <tr>
  613. <td>5 </td> <td>2 </td> <td>0.87831 </td> <td>-0.648886 </td>
  614. </tr>
  615. <tr>
  616. <td>6 </td> <td>7 </td> <td>1.46385 </td> <td>1.78444 </td>
  617. </tr>
  618. </tbody>
  619. </table>
  620. </div>
  621. </div>
  622. </div>
  623. </div>
  624. </div>
  625. </div>
  626. </div>
  627. <div class="jb_cell">
  628. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  629. <div class="text_cell_render border-box-sizing rendered_html">
  630. <p><strong>Step 2.</strong> Multiply each pair of standard units.</p>
  631. </div>
  632. </div>
  633. </div>
  634. </div>
  635. <div class="jb_cell">
  636. <div class="cell border-box-sizing code_cell rendered">
  637. <div class="input">
  638. <div class="inner_cell">
  639. <div class="input_area">
  640. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">t_product</span> <span class="o">=</span> <span class="n">t_su</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;product of standard units&#39;</span><span class="p">,</span> <span class="n">t_su</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">t_su</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
  641. <span class="n">t_product</span>
  642. </pre></div>
  643. </div>
  644. </div>
  645. </div>
  646. <div class="output_wrapper">
  647. <div class="output">
  648. <div class="jb_output_wrapper }}">
  649. <div class="output_area">
  650. <div class="output_html rendered_html output_subarea output_execute_result">
  651. <table border="1" class="dataframe">
  652. <thead>
  653. <tr>
  654. <th>x</th> <th>y</th> <th>x (standard units)</th> <th>y (standard units)</th> <th>product of standard units</th>
  655. </tr>
  656. </thead>
  657. <tbody>
  658. <tr>
  659. <td>1 </td> <td>2 </td> <td>-1.46385 </td> <td>-0.648886 </td> <td>0.949871 </td>
  660. </tr>
  661. <tr>
  662. <td>2 </td> <td>3 </td> <td>-0.87831 </td> <td>-0.162221 </td> <td>0.142481 </td>
  663. </tr>
  664. <tr>
  665. <td>3 </td> <td>1 </td> <td>-0.29277 </td> <td>-1.13555 </td> <td>0.332455 </td>
  666. </tr>
  667. <tr>
  668. <td>4 </td> <td>5 </td> <td>0.29277 </td> <td>0.811107 </td> <td>0.237468 </td>
  669. </tr>
  670. <tr>
  671. <td>5 </td> <td>2 </td> <td>0.87831 </td> <td>-0.648886 </td> <td>-0.569923 </td>
  672. </tr>
  673. <tr>
  674. <td>6 </td> <td>7 </td> <td>1.46385 </td> <td>1.78444 </td> <td>2.61215 </td>
  675. </tr>
  676. </tbody>
  677. </table>
  678. </div>
  679. </div>
  680. </div>
  681. </div>
  682. </div>
  683. </div>
  684. </div>
  685. <div class="jb_cell">
  686. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  687. <div class="text_cell_render border-box-sizing rendered_html">
  688. <p><strong>Step 3.</strong> $r$ is the average of the products computed in Step 2.</p>
  689. </div>
  690. </div>
  691. </div>
  692. </div>
  693. <div class="jb_cell">
  694. <div class="cell border-box-sizing code_cell rendered">
  695. <div class="input">
  696. <div class="inner_cell">
  697. <div class="input_area">
  698. <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># r is the average of the products of standard units</span>
  699. <span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">t_product</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">4</span><span class="p">))</span>
  700. <span class="n">r</span>
  701. </pre></div>
  702. </div>
  703. </div>
  704. </div>
  705. <div class="output_wrapper">
  706. <div class="output">
  707. <div class="jb_output_wrapper }}">
  708. <div class="output_area">
  709. <div class="output_text output_subarea output_execute_result">
  710. <pre>0.6174163971897709</pre>
  711. </div>
  712. </div>
  713. </div>
  714. </div>
  715. </div>
  716. </div>
  717. </div>
  718. <div class="jb_cell">
  719. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  720. <div class="text_cell_render border-box-sizing rendered_html">
  721. <p>As expected, $r$ is positive but not equal to 1.</p>
  722. </div>
  723. </div>
  724. </div>
  725. </div>
  726. <div class="jb_cell">
  727. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  728. <div class="text_cell_render border-box-sizing rendered_html">
  729. <h3 id="Properties-of-$r$">Properties of $r$<a class="anchor-link" href="#Properties-of-$r$"> </a></h3><p>The calculation shows that:</p>
  730. <ul>
  731. <li>$r$ is a pure number. It has no units. This is because $r$ is based on standard units.</li>
  732. <li>$r$ is unaffected by changing the units on either axis. This too is because $r$ is based on standard units.</li>
  733. <li>$r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called $x$ and which $y$. Geometrically, switching axes reflects the scatter plot about the line $y=x$, but does not change the amount of clustering nor the sign of the association.</li>
  734. </ul>
  735. </div>
  736. </div>
  737. </div>
  738. </div>
  739. <div class="jb_cell">
  740. <div class="cell border-box-sizing code_cell rendered">
  741. <div class="input">
  742. <div class="inner_cell">
  743. <div class="input_area">
  744. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">t</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;red&#39;</span><span class="p">)</span>
  745. </pre></div>
  746. </div>
  747. </div>
  748. </div>
  749. <div class="output_wrapper">
  750. <div class="output">
  751. <div class="jb_output_wrapper }}">
  752. <div class="output_area">
  753. <div class="output_png output_subarea ">
  754. <img src="../../../images/chapters/15/1/Correlation_37_0.png"
  755. >
  756. </div>
  757. </div>
  758. </div>
  759. </div>
  760. </div>
  761. </div>
  762. </div>
  763. <div class="jb_cell">
  764. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  765. <div class="text_cell_render border-box-sizing rendered_html">
  766. <h3 id="The-correlation-function">The <code>correlation</code> function<a class="anchor-link" href="#The-correlation-function"> </a></h3><p>We are going to be calculating correlations repeatedly, so it will help to define a function that computes it by performing all the steps described above. Let's define a function <code>correlation</code> that takes a table and the labels of two columns in the table. The function returns $r$, the mean of the products of those column values in standard units.</p>
  767. </div>
  768. </div>
  769. </div>
  770. </div>
  771. <div class="jb_cell">
  772. <div class="cell border-box-sizing code_cell rendered">
  773. <div class="input">
  774. <div class="inner_cell">
  775. <div class="input_area">
  776. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">correlation</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
  777. <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">standard_units</span><span class="p">(</span><span class="n">t</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">*</span><span class="n">standard_units</span><span class="p">(</span><span class="n">t</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">y</span><span class="p">)))</span>
  778. </pre></div>
  779. </div>
  780. </div>
  781. </div>
  782. </div>
  783. </div>
  784. <div class="jb_cell">
  785. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  786. <div class="text_cell_render border-box-sizing rendered_html">
  787. <p>Let's call the function on the <code>x</code> and <code>y</code> columns of <code>t</code>. The function returns the same answer to the correlation between $x$ and $y$ as we got by direct application of the formula for $r$.</p>
  788. </div>
  789. </div>
  790. </div>
  791. </div>
  792. <div class="jb_cell">
  793. <div class="cell border-box-sizing code_cell rendered">
  794. <div class="input">
  795. <div class="inner_cell">
  796. <div class="input_area">
  797. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">)</span>
  798. </pre></div>
  799. </div>
  800. </div>
  801. </div>
  802. <div class="output_wrapper">
  803. <div class="output">
  804. <div class="jb_output_wrapper }}">
  805. <div class="output_area">
  806. <div class="output_text output_subarea output_execute_result">
  807. <pre>0.6174163971897709</pre>
  808. </div>
  809. </div>
  810. </div>
  811. </div>
  812. </div>
  813. </div>
  814. </div>
  815. <div class="jb_cell">
  816. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  817. <div class="text_cell_render border-box-sizing rendered_html">
  818. <p>As we noticed, the order in which the variables are specified doesn't matter.</p>
  819. </div>
  820. </div>
  821. </div>
  822. </div>
  823. <div class="jb_cell">
  824. <div class="cell border-box-sizing code_cell rendered">
  825. <div class="input">
  826. <div class="inner_cell">
  827. <div class="input_area">
  828. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">)</span>
  829. </pre></div>
  830. </div>
  831. </div>
  832. </div>
  833. <div class="output_wrapper">
  834. <div class="output">
  835. <div class="jb_output_wrapper }}">
  836. <div class="output_area">
  837. <div class="output_text output_subarea output_execute_result">
  838. <pre>0.6174163971897709</pre>
  839. </div>
  840. </div>
  841. </div>
  842. </div>
  843. </div>
  844. </div>
  845. </div>
  846. <div class="jb_cell">
  847. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  848. <div class="text_cell_render border-box-sizing rendered_html">
  849. <p>Calling <code>correlation</code> on columns of the table <code>suv</code> gives us the correlation between price and mileage as well as the correlation between price and acceleration.</p>
  850. </div>
  851. </div>
  852. </div>
  853. </div>
  854. <div class="jb_cell">
  855. <div class="cell border-box-sizing code_cell rendered">
  856. <div class="input">
  857. <div class="inner_cell">
  858. <div class="input_area">
  859. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">suv</span><span class="p">,</span> <span class="s1">&#39;mpg&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  860. </pre></div>
  861. </div>
  862. </div>
  863. </div>
  864. <div class="output_wrapper">
  865. <div class="output">
  866. <div class="jb_output_wrapper }}">
  867. <div class="output_area">
  868. <div class="output_text output_subarea output_execute_result">
  869. <pre>-0.6667143635709919</pre>
  870. </div>
  871. </div>
  872. </div>
  873. </div>
  874. </div>
  875. </div>
  876. </div>
  877. <div class="jb_cell">
  878. <div class="cell border-box-sizing code_cell rendered">
  879. <div class="input">
  880. <div class="inner_cell">
  881. <div class="input_area">
  882. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">suv</span><span class="p">,</span> <span class="s1">&#39;acceleration&#39;</span><span class="p">,</span> <span class="s1">&#39;msrp&#39;</span><span class="p">)</span>
  883. </pre></div>
  884. </div>
  885. </div>
  886. </div>
  887. <div class="output_wrapper">
  888. <div class="output">
  889. <div class="jb_output_wrapper }}">
  890. <div class="output_area">
  891. <div class="output_text output_subarea output_execute_result">
  892. <pre>0.48699799279959155</pre>
  893. </div>
  894. </div>
  895. </div>
  896. </div>
  897. </div>
  898. </div>
  899. </div>
  900. <div class="jb_cell">
  901. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  902. <div class="text_cell_render border-box-sizing rendered_html">
  903. <p>These values confirm what we had observed:</p>
  904. <ul>
  905. <li>There is a negative association between price and efficiency, whereas the association between price and acceleration is positive.</li>
  906. <li>The linear relation between price and acceleration is a little weaker (correlation about 0.5) than between price and mileage (correlation about -0.67). </li>
  907. </ul>
  908. </div>
  909. </div>
  910. </div>
  911. </div>
  912. <div class="jb_cell">
  913. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  914. <div class="text_cell_render border-box-sizing rendered_html">
  915. <p>Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of what correlation does and does not measure.</p>
  916. </div>
  917. </div>
  918. </div>
  919. </div>
  920. <div class="jb_cell">
  921. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  922. <div class="text_cell_render border-box-sizing rendered_html">
  923. <h3 id="Association-is-not-Causation">Association is not Causation<a class="anchor-link" href="#Association-is-not-Causation"> </a></h3><p>Correlation only measures association. Correlation does not imply causation. Though the correlation between the weight and the math ability of children in a school district may be positive, that does not mean that doing math makes children heavier or that putting on weight improves the children's math skills. Age is a confounding variable: older children are both heavier and better at math than younger children, on average.</p>
  924. </div>
  925. </div>
  926. </div>
  927. </div>
  928. <div class="jb_cell">
  929. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  930. <div class="text_cell_render border-box-sizing rendered_html">
  931. <h3 id="Correlation-Measures-Linear-Association">Correlation Measures <em>Linear</em> Association<a class="anchor-link" href="#Correlation-Measures-Linear-Association"> </a></h3><p>Correlation measures only one kind of association – linear. Variables that have strong non-linear association might have very low correlation. Here is an example of variables that have a perfect quadratic relation $y = x^2$ but have correlation equal to 0.</p>
  932. </div>
  933. </div>
  934. </div>
  935. </div>
  936. <div class="jb_cell">
  937. <div class="cell border-box-sizing code_cell rendered">
  938. <div class="input">
  939. <div class="inner_cell">
  940. <div class="input_area">
  941. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">new_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mf">4.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">)</span>
  942. <span class="n">nonlinear</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  943. <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="n">new_x</span><span class="p">,</span>
  944. <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">new_x</span><span class="o">**</span><span class="mi">2</span>
  945. <span class="p">)</span>
  946. <span class="n">nonlinear</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
  947. </pre></div>
  948. </div>
  949. </div>
  950. </div>
  951. <div class="output_wrapper">
  952. <div class="output">
  953. <div class="jb_output_wrapper }}">
  954. <div class="output_area">
  955. <div class="output_png output_subarea ">
  956. <img src="../../../images/chapters/15/1/Correlation_51_0.png"
  957. >
  958. </div>
  959. </div>
  960. </div>
  961. </div>
  962. </div>
  963. </div>
  964. </div>
  965. <div class="jb_cell">
  966. <div class="cell border-box-sizing code_cell rendered">
  967. <div class="input">
  968. <div class="inner_cell">
  969. <div class="input_area">
  970. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">nonlinear</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">)</span>
  971. </pre></div>
  972. </div>
  973. </div>
  974. </div>
  975. <div class="output_wrapper">
  976. <div class="output">
  977. <div class="jb_output_wrapper }}">
  978. <div class="output_area">
  979. <div class="output_text output_subarea output_execute_result">
  980. <pre>0.0</pre>
  981. </div>
  982. </div>
  983. </div>
  984. </div>
  985. </div>
  986. </div>
  987. </div>
  988. <div class="jb_cell">
  989. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  990. <div class="text_cell_render border-box-sizing rendered_html">
  991. <h3 id="Correlation-is-Affected-by-Outliers">Correlation is Affected by Outliers<a class="anchor-link" href="#Correlation-is-Affected-by-Outliers"> </a></h3><p>Outliers can have a big effect on correlation. Here is an example where a scatter plot for which $r$ is equal to 1 is turned into a plot for which $r$ is equal to 0, by the addition of just one outlying point.</p>
  992. </div>
  993. </div>
  994. </div>
  995. </div>
  996. <div class="jb_cell">
  997. <div class="cell border-box-sizing code_cell rendered">
  998. <div class="input">
  999. <div class="inner_cell">
  1000. <div class="input_area">
  1001. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">line</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  1002. <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>
  1003. <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
  1004. <span class="p">)</span>
  1005. <span class="n">line</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
  1006. </pre></div>
  1007. </div>
  1008. </div>
  1009. </div>
  1010. <div class="output_wrapper">
  1011. <div class="output">
  1012. <div class="jb_output_wrapper }}">
  1013. <div class="output_area">
  1014. <div class="output_png output_subarea ">
  1015. <img src="../../../images/chapters/15/1/Correlation_54_0.png"
  1016. >
  1017. </div>
  1018. </div>
  1019. </div>
  1020. </div>
  1021. </div>
  1022. </div>
  1023. </div>
  1024. <div class="jb_cell">
  1025. <div class="cell border-box-sizing code_cell rendered">
  1026. <div class="input">
  1027. <div class="inner_cell">
  1028. <div class="input_area">
  1029. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">)</span>
  1030. </pre></div>
  1031. </div>
  1032. </div>
  1033. </div>
  1034. <div class="output_wrapper">
  1035. <div class="output">
  1036. <div class="jb_output_wrapper }}">
  1037. <div class="output_area">
  1038. <div class="output_text output_subarea output_execute_result">
  1039. <pre>1.0</pre>
  1040. </div>
  1041. </div>
  1042. </div>
  1043. </div>
  1044. </div>
  1045. </div>
  1046. </div>
  1047. <div class="jb_cell">
  1048. <div class="cell border-box-sizing code_cell rendered">
  1049. <div class="input">
  1050. <div class="inner_cell">
  1051. <div class="input_area">
  1052. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">outlier</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  1053. <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
  1054. <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
  1055. <span class="p">)</span>
  1056. <span class="n">outlier</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
  1057. </pre></div>
  1058. </div>
  1059. </div>
  1060. </div>
  1061. <div class="output_wrapper">
  1062. <div class="output">
  1063. <div class="jb_output_wrapper }}">
  1064. <div class="output_area">
  1065. <div class="output_png output_subarea ">
  1066. <img src="../../../images/chapters/15/1/Correlation_56_0.png"
  1067. >
  1068. </div>
  1069. </div>
  1070. </div>
  1071. </div>
  1072. </div>
  1073. </div>
  1074. </div>
  1075. <div class="jb_cell">
  1076. <div class="cell border-box-sizing code_cell rendered">
  1077. <div class="input">
  1078. <div class="inner_cell">
  1079. <div class="input_area">
  1080. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">outlier</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;y&#39;</span><span class="p">)</span>
  1081. </pre></div>
  1082. </div>
  1083. </div>
  1084. </div>
  1085. <div class="output_wrapper">
  1086. <div class="output">
  1087. <div class="jb_output_wrapper }}">
  1088. <div class="output_area">
  1089. <div class="output_text output_subarea output_execute_result">
  1090. <pre>0.0</pre>
  1091. </div>
  1092. </div>
  1093. </div>
  1094. </div>
  1095. </div>
  1096. </div>
  1097. </div>
  1098. <div class="jb_cell">
  1099. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  1100. <div class="text_cell_render border-box-sizing rendered_html">
  1101. <h3 id="Ecological-Correlations-Should-be-Interpreted-with-Care">Ecological Correlations Should be Interpreted with Care<a class="anchor-link" href="#Ecological-Correlations-Should-be-Interpreted-with-Care"> </a></h3><p>Correlations based on aggregated data can be misleading. As an example, here are data on the Critical Reading and Math SAT scores in 2014. There is one point for each of the 50 states and one for Washington, D.C. The column <code>Participation Rate</code> contains the percent of high school seniors who took the test. The next three columns show the average score in the state on each portion of the test, and the final column is the average of the total scores on the test.</p>
  1102. </div>
  1103. </div>
  1104. </div>
  1105. </div>
  1106. <div class="jb_cell">
  1107. <div class="cell border-box-sizing code_cell rendered">
  1108. <div class="input">
  1109. <div class="inner_cell">
  1110. <div class="input_area">
  1111. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sat2014</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;sat2014.csv&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">&#39;State&#39;</span><span class="p">)</span>
  1112. <span class="n">sat2014</span>
  1113. </pre></div>
  1114. </div>
  1115. </div>
  1116. </div>
  1117. <div class="output_wrapper">
  1118. <div class="output">
  1119. <div class="jb_output_wrapper }}">
  1120. <div class="output_area">
  1121. <div class="output_html rendered_html output_subarea output_execute_result">
  1122. <table border="1" class="dataframe">
  1123. <thead>
  1124. <tr>
  1125. <th>State</th> <th>Participation Rate</th> <th>Critical Reading</th> <th>Math</th> <th>Writing</th> <th>Combined</th>
  1126. </tr>
  1127. </thead>
  1128. <tbody>
  1129. <tr>
  1130. <td>Alabama </td> <td>6.7 </td> <td>547 </td> <td>538 </td> <td>532 </td> <td>1617 </td>
  1131. </tr>
  1132. <tr>
  1133. <td>Alaska </td> <td>54.2 </td> <td>507 </td> <td>503 </td> <td>475 </td> <td>1485 </td>
  1134. </tr>
  1135. <tr>
  1136. <td>Arizona </td> <td>36.4 </td> <td>522 </td> <td>525 </td> <td>500 </td> <td>1547 </td>
  1137. </tr>
  1138. <tr>
  1139. <td>Arkansas </td> <td>4.2 </td> <td>573 </td> <td>571 </td> <td>554 </td> <td>1698 </td>
  1140. </tr>
  1141. <tr>
  1142. <td>California </td> <td>60.3 </td> <td>498 </td> <td>510 </td> <td>496 </td> <td>1504 </td>
  1143. </tr>
  1144. <tr>
  1145. <td>Colorado </td> <td>14.3 </td> <td>582 </td> <td>586 </td> <td>567 </td> <td>1735 </td>
  1146. </tr>
  1147. <tr>
  1148. <td>Connecticut </td> <td>88.4 </td> <td>507 </td> <td>510 </td> <td>508 </td> <td>1525 </td>
  1149. </tr>
  1150. <tr>
  1151. <td>Delaware </td> <td>100 </td> <td>456 </td> <td>459 </td> <td>444 </td> <td>1359 </td>
  1152. </tr>
  1153. <tr>
  1154. <td>District of Columbia</td> <td>100 </td> <td>440 </td> <td>438 </td> <td>431 </td> <td>1309 </td>
  1155. </tr>
  1156. <tr>
  1157. <td>Florida </td> <td>72.2 </td> <td>491 </td> <td>485 </td> <td>472 </td> <td>1448 </td>
  1158. </tr>
  1159. </tbody>
  1160. </table>
  1161. <p>... (41 rows omitted)</p>
  1162. </div>
  1163. </div>
  1164. </div>
  1165. </div>
  1166. </div>
  1167. </div>
  1168. </div>
  1169. <div class="jb_cell">
  1170. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  1171. <div class="text_cell_render border-box-sizing rendered_html">
  1172. <p>The scatter diagram of Math scores versus Critical Reading scores is very tightly clustered around a straight line; the correlation is close to 0.985.</p>
  1173. </div>
  1174. </div>
  1175. </div>
  1176. </div>
  1177. <div class="jb_cell">
  1178. <div class="cell border-box-sizing code_cell rendered">
  1179. <div class="input">
  1180. <div class="inner_cell">
  1181. <div class="input_area">
  1182. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sat2014</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">&#39;Critical Reading&#39;</span><span class="p">,</span> <span class="s1">&#39;Math&#39;</span><span class="p">)</span>
  1183. </pre></div>
  1184. </div>
  1185. </div>
  1186. </div>
  1187. <div class="output_wrapper">
  1188. <div class="output">
  1189. <div class="jb_output_wrapper }}">
  1190. <div class="output_area">
  1191. <div class="output_png output_subarea ">
  1192. <img src="../../../images/chapters/15/1/Correlation_61_0.png"
  1193. >
  1194. </div>
  1195. </div>
  1196. </div>
  1197. </div>
  1198. </div>
  1199. </div>
  1200. </div>
  1201. <div class="jb_cell">
  1202. <div class="cell border-box-sizing code_cell rendered">
  1203. <div class="input">
  1204. <div class="inner_cell">
  1205. <div class="input_area">
  1206. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">sat2014</span><span class="p">,</span> <span class="s1">&#39;Critical Reading&#39;</span><span class="p">,</span> <span class="s1">&#39;Math&#39;</span><span class="p">)</span>
  1207. </pre></div>
  1208. </div>
  1209. </div>
  1210. </div>
  1211. <div class="output_wrapper">
  1212. <div class="output">
  1213. <div class="jb_output_wrapper }}">
  1214. <div class="output_area">
  1215. <div class="output_text output_subarea output_execute_result">
  1216. <pre>0.9847558411067434</pre>
  1217. </div>
  1218. </div>
  1219. </div>
  1220. </div>
  1221. </div>
  1222. </div>
  1223. </div>
  1224. <div class="jb_cell">
  1225. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  1226. <div class="text_cell_render border-box-sizing rendered_html">
  1227. <p>That's an extremely high correlation. But it's important to note that this does not reflect the strength of the relation between the Math and Critical Reading scores of <em>students</em>.</p>
  1228. <p>The data consist of average scores in each state. But states don't take tests – students do. The data in the table have been created by lumping all the students in each state into a single point at the average values of the two variables in that state. But not all students in the state will be at that point, as students vary in their performance. If you plot a point for each student instead of just one for each state, there will be a cloud of points around each point in the figure above. The overall picture will be more fuzzy. The correlation between the Math and Critical Reading scores of the students will be <em>lower</em> than the value calculated based on state averages.</p>
  1229. <p>Correlations based on aggregates and averages are called <em>ecological correlations</em> and are frequently reported. As we have just seen, they must be interpreted with care.</p>
  1230. </div>
  1231. </div>
  1232. </div>
  1233. </div>
  1234. <div class="jb_cell">
  1235. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  1236. <div class="text_cell_render border-box-sizing rendered_html">
  1237. <h3 id="Serious-or-tongue-in-cheek?">Serious or tongue-in-cheek?<a class="anchor-link" href="#Serious-or-tongue-in-cheek?"> </a></h3><p>In 2012, a <a href="http://www.biostat.jhsph.edu/courses/bio621/misc/Chocolate%20consumption%20cognitive%20function%20and%20nobel%20laurates%20%28NEJM%29.pdf">paper</a> in the respected New England Journal of Medicine examined the relation between chocolate consumption and Nobel Prizes in a group of countries. The <a href="http://blogs.scientificamerican.com/the-curious-wavefunction/chocolate-consumption-and-nobel-prizes-a-bizarre-juxtaposition-if-there-ever-was-one/">Scientific American</a> responded seriously whereas
  1238. <a href="http://www.reuters.com/article/2012/10/10/us-eat-chocolate-win-the-nobel-prize-idUSBRE8991MS20121010#vFdfFkbPVlilSjsB.97">others</a> were more relaxed. You are welcome to make your own decision! The following graph, provided in the paper, should motivate you to go and take a look.</p>
  1239. </div>
  1240. </div>
  1241. </div>
  1242. </div>
  1243. <div class="jb_cell tag_remove_input">
  1244. <div class="cell border-box-sizing code_cell rendered">
  1245. <div class="output_wrapper">
  1246. <div class="output">
  1247. <div class="jb_output_wrapper }}">
  1248. <div class="output_area">
  1249. <div class="output_png output_subarea output_execute_result">
  1250. <img src="../../../images/chapters/15/1/Correlation_65_0.png"
  1251. >
  1252. </div>
  1253. </div>
  1254. </div>
  1255. </div>
  1256. </div>
  1257. </div>
  1258. </div>