Sampling_and_Empirical_Distributions.html 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441
  1. ---
  2. redirect_from:
  3. - "/chapters/10/sampling-and-empirical-distributions"
  4. interact_link: content/chapters/10/Sampling_and_Empirical_Distributions.ipynb
  5. kernel_name: Python [Root]
  6. has_widgets: false
  7. title: |-
  8. Sampling and Empirical Distributions
  9. prev_page:
  10. url: /chapters/09/5/Finding_Probabilities.html
  11. title: |-
  12. Finding Probabilities
  13. next_page:
  14. url: /chapters/10/1/Empirical_Distributions.html
  15. title: |-
  16. Empirical Distributions
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell">
  24. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  25. <div class="text_cell_render border-box-sizing rendered_html">
  26. <h3 id="Sampling-and-Empirical-Distributions">Sampling and Empirical Distributions<a class="anchor-link" href="#Sampling-and-Empirical-Distributions"> </a></h3><p>An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.</p>
  27. <p>In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples.</p>
  28. <p>Let's start by drawing some samples. Our examples are based on the <code><a href="imdb.csv">top_movies.csv</a></code> data set.</p>
  29. </div>
  30. </div>
  31. </div>
  32. </div>
  33. <div class="jb_cell">
  34. <div class="cell border-box-sizing code_cell rendered">
  35. <div class="input">
  36. <div class="inner_cell">
  37. <div class="input_area">
  38. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">top1</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;top_movies.csv&#39;</span><span class="p">)</span>
  39. <span class="n">top2</span> <span class="o">=</span> <span class="n">top1</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Row Index&#39;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">top1</span><span class="o">.</span><span class="n">num_rows</span><span class="p">))</span>
  40. <span class="n">top</span> <span class="o">=</span> <span class="n">top2</span><span class="o">.</span><span class="n">move_to_start</span><span class="p">(</span><span class="s1">&#39;Row Index&#39;</span><span class="p">)</span>
  41. <span class="n">top</span><span class="o">.</span><span class="n">set_format</span><span class="p">(</span><span class="n">make_array</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">NumberFormatter</span><span class="p">)</span>
  42. </pre></div>
  43. </div>
  44. </div>
  45. </div>
  46. <div class="output_wrapper">
  47. <div class="output">
  48. <div class="jb_output_wrapper }}">
  49. <div class="output_area">
  50. <div class="output_html rendered_html output_subarea output_execute_result">
  51. <table border="1" class="dataframe">
  52. <thead>
  53. <tr>
  54. <th>Row Index</th> <th>Title</th> <th>Studio</th> <th>Gross</th> <th>Gross (Adjusted)</th> <th>Year</th>
  55. </tr>
  56. </thead>
  57. <tbody>
  58. <tr>
  59. <td>0 </td> <td>Star Wars: The Force Awakens </td> <td>Buena Vista (Disney)</td> <td>906,723,418</td> <td>906,723,400 </td> <td>2015</td>
  60. </tr>
  61. <tr>
  62. <td>1 </td> <td>Avatar </td> <td>Fox </td> <td>760,507,625</td> <td>846,120,800 </td> <td>2009</td>
  63. </tr>
  64. <tr>
  65. <td>2 </td> <td>Titanic </td> <td>Paramount </td> <td>658,672,302</td> <td>1,178,627,900 </td> <td>1997</td>
  66. </tr>
  67. <tr>
  68. <td>3 </td> <td>Jurassic World </td> <td>Universal </td> <td>652,270,625</td> <td>687,728,000 </td> <td>2015</td>
  69. </tr>
  70. <tr>
  71. <td>4 </td> <td>Marvel's The Avengers </td> <td>Buena Vista (Disney)</td> <td>623,357,910</td> <td>668,866,600 </td> <td>2012</td>
  72. </tr>
  73. <tr>
  74. <td>5 </td> <td>The Dark Knight </td> <td>Warner Bros. </td> <td>534,858,444</td> <td>647,761,600 </td> <td>2008</td>
  75. </tr>
  76. <tr>
  77. <td>6 </td> <td>Star Wars: Episode I - The Phantom Menace</td> <td>Fox </td> <td>474,544,677</td> <td>785,715,000 </td> <td>1999</td>
  78. </tr>
  79. <tr>
  80. <td>7 </td> <td>Star Wars </td> <td>Fox </td> <td>460,998,007</td> <td>1,549,640,500 </td> <td>1977</td>
  81. </tr>
  82. <tr>
  83. <td>8 </td> <td>Avengers: Age of Ultron </td> <td>Buena Vista (Disney)</td> <td>459,005,868</td> <td>465,684,200 </td> <td>2015</td>
  84. </tr>
  85. <tr>
  86. <td>9 </td> <td>The Dark Knight Rises </td> <td>Warner Bros. </td> <td>448,139,099</td> <td>500,961,700 </td> <td>2012</td>
  87. </tr>
  88. </tbody>
  89. </table>
  90. <p>... (190 rows omitted)</p>
  91. </div>
  92. </div>
  93. </div>
  94. </div>
  95. </div>
  96. </div>
  97. </div>
  98. <div class="jb_cell">
  99. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  100. <div class="text_cell_render border-box-sizing rendered_html">
  101. <h3 id="Sampling-Rows-of-a-Table">Sampling Rows of a Table<a class="anchor-link" href="#Sampling-Rows-of-a-Table"> </a></h3><p>Each row of a data table represents an individual; in <code>top</code>, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.</p>
  102. <p>The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables.</p>
  103. </div>
  104. </div>
  105. </div>
  106. </div>
  107. <div class="jb_cell">
  108. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  109. <div class="text_cell_render border-box-sizing rendered_html">
  110. <h3 id="Deterministic-Samples">Deterministic Samples<a class="anchor-link" href="#Deterministic-Samples"> </a></h3><p>When you simply specify which elements of a set you want to choose, without any chances involved, you create a <em>deterministic sample</em>.</p>
  111. <p>You have done this many times, for example by using <code>take</code>:</p>
  112. </div>
  113. </div>
  114. </div>
  115. </div>
  116. <div class="jb_cell">
  117. <div class="cell border-box-sizing code_cell rendered">
  118. <div class="input">
  119. <div class="inner_cell">
  120. <div class="input_area">
  121. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">top</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">make_array</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="mi">100</span><span class="p">))</span>
  122. </pre></div>
  123. </div>
  124. </div>
  125. </div>
  126. <div class="output_wrapper">
  127. <div class="output">
  128. <div class="jb_output_wrapper }}">
  129. <div class="output_area">
  130. <div class="output_html rendered_html output_subarea output_execute_result">
  131. <table border="1" class="dataframe">
  132. <thead>
  133. <tr>
  134. <th>Row Index</th> <th>Title</th> <th>Studio</th> <th>Gross</th> <th>Gross (Adjusted)</th> <th>Year</th>
  135. </tr>
  136. </thead>
  137. <tbody>
  138. <tr>
  139. <td>3 </td> <td>Jurassic World </td> <td>Universal</td> <td>652,270,625</td> <td>687,728,000 </td> <td>2015</td>
  140. </tr>
  141. <tr>
  142. <td>18 </td> <td>Spider-Man </td> <td>Sony </td> <td>403,706,375</td> <td>604,517,300 </td> <td>2002</td>
  143. </tr>
  144. <tr>
  145. <td>100 </td> <td>Gone with the Wind</td> <td>MGM </td> <td>198,676,459</td> <td>1,757,788,200 </td> <td>1939</td>
  146. </tr>
  147. </tbody>
  148. </table>
  149. </div>
  150. </div>
  151. </div>
  152. </div>
  153. </div>
  154. </div>
  155. </div>
  156. <div class="jb_cell">
  157. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  158. <div class="text_cell_render border-box-sizing rendered_html">
  159. <p>You have also used <code>where</code>:</p>
  160. </div>
  161. </div>
  162. </div>
  163. </div>
  164. <div class="jb_cell">
  165. <div class="cell border-box-sizing code_cell rendered">
  166. <div class="input">
  167. <div class="inner_cell">
  168. <div class="input_area">
  169. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">top</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Title&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">containing</span><span class="p">(</span><span class="s1">&#39;Harry Potter&#39;</span><span class="p">))</span>
  170. </pre></div>
  171. </div>
  172. </div>
  173. </div>
  174. <div class="output_wrapper">
  175. <div class="output">
  176. <div class="jb_output_wrapper }}">
  177. <div class="output_area">
  178. <div class="output_html rendered_html output_subarea output_execute_result">
  179. <table border="1" class="dataframe">
  180. <thead>
  181. <tr>
  182. <th>Row Index</th> <th>Title</th> <th>Studio</th> <th>Gross</th> <th>Gross (Adjusted)</th> <th>Year</th>
  183. </tr>
  184. </thead>
  185. <tbody>
  186. <tr>
  187. <td>22 </td> <td>Harry Potter and the Deathly Hallows Part 2</td> <td>Warner Bros.</td> <td>381,011,219</td> <td>417,512,200 </td> <td>2011</td>
  188. </tr>
  189. <tr>
  190. <td>43 </td> <td>Harry Potter and the Sorcerer's Stone </td> <td>Warner Bros.</td> <td>317,575,550</td> <td>486,442,900 </td> <td>2001</td>
  191. </tr>
  192. <tr>
  193. <td>54 </td> <td>Harry Potter and the Half-Blood Prince </td> <td>Warner Bros.</td> <td>301,959,197</td> <td>352,098,800 </td> <td>2009</td>
  194. </tr>
  195. <tr>
  196. <td>59 </td> <td>Harry Potter and the Order of the Phoenix </td> <td>Warner Bros.</td> <td>292,004,738</td> <td>369,250,200 </td> <td>2007</td>
  197. </tr>
  198. <tr>
  199. <td>62 </td> <td>Harry Potter and the Goblet of Fire </td> <td>Warner Bros.</td> <td>290,013,036</td> <td>393,024,800 </td> <td>2005</td>
  200. </tr>
  201. <tr>
  202. <td>69 </td> <td>Harry Potter and the Chamber of Secrets </td> <td>Warner Bros.</td> <td>261,988,482</td> <td>390,768,100 </td> <td>2002</td>
  203. </tr>
  204. <tr>
  205. <td>76 </td> <td>Harry Potter and the Prisoner of Azkaban </td> <td>Warner Bros.</td> <td>249,541,069</td> <td>349,598,600 </td> <td>2004</td>
  206. </tr>
  207. </tbody>
  208. </table>
  209. </div>
  210. </div>
  211. </div>
  212. </div>
  213. </div>
  214. </div>
  215. </div>
  216. <div class="jb_cell">
  217. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  218. <div class="text_cell_render border-box-sizing rendered_html">
  219. <p>While these are samples, they are not random samples. They don't involve chance.</p>
  220. </div>
  221. </div>
  222. </div>
  223. </div>
  224. <div class="jb_cell">
  225. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  226. <div class="text_cell_render border-box-sizing rendered_html">
  227. <h2 id="Probability-Samples">Probability Samples<a class="anchor-link" href="#Probability-Samples"> </a></h2>
  228. </div>
  229. </div>
  230. </div>
  231. </div>
  232. <div class="jb_cell">
  233. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  234. <div class="text_cell_render border-box-sizing rendered_html">
  235. <p>For describing random samples, some terminology will be helpful.</p>
  236. <p>A <em>population</em> is the set of all elements from whom a sample will be drawn.</p>
  237. <p>A <em>probability sample</em> is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.</p>
  238. <p>In a probability sample, all elements need not have the same chance of being chosen.</p>
  239. <h3 id="A-Random-Sampling-Scheme">A Random Sampling Scheme<a class="anchor-link" href="#A-Random-Sampling-Scheme"> </a></h3><p>For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:</p>
  240. <ul>
  241. <li>Person A is chosen with probability 1.</li>
  242. <li>One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.</li>
  243. </ul>
  244. <p>This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:</p>
  245. <pre><code>A: 1
  246. B: 1/2
  247. C: 1/2
  248. AB: 1/2
  249. AC: 1/2
  250. BC: 0
  251. ABC: 0
  252. </code></pre>
  253. <p>Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample.</p>
  254. </div>
  255. </div>
  256. </div>
  257. </div>
  258. <div class="jb_cell">
  259. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  260. <div class="text_cell_render border-box-sizing rendered_html">
  261. <h3 id="A-Systematic-Sample">A Systematic Sample<a class="anchor-link" href="#A-Systematic-Sample"> </a></h3><p>Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a <em>systematic sample</em>.</p>
  262. <p>Here we will choose a systematic sample of the rows of <code>top</code>. We will start by picking one of the first 10 rows at random, and then we will pick every 10th row after that.</p>
  263. </div>
  264. </div>
  265. </div>
  266. </div>
  267. <div class="jb_cell">
  268. <div class="cell border-box-sizing code_cell rendered">
  269. <div class="input">
  270. <div class="inner_cell">
  271. <div class="input_area">
  272. <div class=" highlight hl-ipython3"><pre><span></span><span class="sd">&quot;&quot;&quot;Choose a random start among rows 0 through 9;</span>
  273. <span class="sd">then take every 10th row.&quot;&quot;&quot;</span>
  274. <span class="n">start</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
  275. <span class="n">top</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">top</span><span class="o">.</span><span class="n">num_rows</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
  276. </pre></div>
  277. </div>
  278. </div>
  279. </div>
  280. <div class="output_wrapper">
  281. <div class="output">
  282. <div class="jb_output_wrapper }}">
  283. <div class="output_area">
  284. <div class="output_html rendered_html output_subarea output_execute_result">
  285. <table border="1" class="dataframe">
  286. <thead>
  287. <tr>
  288. <th>Row Index</th> <th>Title</th> <th>Studio</th> <th>Gross</th> <th>Gross (Adjusted)</th> <th>Year</th>
  289. </tr>
  290. </thead>
  291. <tbody>
  292. <tr>
  293. <td>2 </td> <td>Titanic </td> <td>Paramount </td> <td>658,672,302</td> <td>1,178,627,900 </td> <td>1997</td>
  294. </tr>
  295. <tr>
  296. <td>12 </td> <td>The Hunger Games: Catching Fire </td> <td>Lionsgate </td> <td>424,668,047</td> <td>444,697,400 </td> <td>2013</td>
  297. </tr>
  298. <tr>
  299. <td>22 </td> <td>Harry Potter and the Deathly Hallows Part 2</td> <td>Warner Bros.</td> <td>381,011,219</td> <td>417,512,200 </td> <td>2011</td>
  300. </tr>
  301. <tr>
  302. <td>32 </td> <td>American Sniper </td> <td>Warner Bros.</td> <td>350,126,372</td> <td>374,796,000 </td> <td>2014</td>
  303. </tr>
  304. <tr>
  305. <td>42 </td> <td>Iron Man </td> <td>Paramount </td> <td>318,412,101</td> <td>385,808,100 </td> <td>2008</td>
  306. </tr>
  307. <tr>
  308. <td>52 </td> <td>Skyfall </td> <td>Sony </td> <td>304,360,277</td> <td>329,225,400 </td> <td>2012</td>
  309. </tr>
  310. <tr>
  311. <td>62 </td> <td>Harry Potter and the Goblet of Fire </td> <td>Warner Bros.</td> <td>290,013,036</td> <td>393,024,800 </td> <td>2005</td>
  312. </tr>
  313. <tr>
  314. <td>72 </td> <td>Jaws </td> <td>Universal </td> <td>260,000,000</td> <td>1,114,285,700 </td> <td>1975</td>
  315. </tr>
  316. <tr>
  317. <td>82 </td> <td>Twister </td> <td>Warner Bros.</td> <td>241,721,524</td> <td>475,786,700 </td> <td>1996</td>
  318. </tr>
  319. <tr>
  320. <td>92 </td> <td>Ghost </td> <td>Paramount </td> <td>217,631,306</td> <td>447,747,400 </td> <td>1990</td>
  321. </tr>
  322. </tbody>
  323. </table>
  324. <p>... (10 rows omitted)</p>
  325. </div>
  326. </div>
  327. </div>
  328. </div>
  329. </div>
  330. </div>
  331. </div>
  332. <div class="jb_cell">
  333. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  334. <div class="text_cell_render border-box-sizing rendered_html">
  335. <p>Run the cell a few times to see how the output varies.</p>
  336. <p>This systematic sample is a probability sample. In this scheme, all rows have chance $1/10$ of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is $1/10$.</p>
  337. <p>But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, like the subset containing the first 11 rows of the table, are selected with chance 0.</p>
  338. </div>
  339. </div>
  340. </div>
  341. </div>
  342. <div class="jb_cell">
  343. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  344. <div class="text_cell_render border-box-sizing rendered_html">
  345. <h3 id="Random-Samples-Drawn-With-or-Without-Replacement">Random Samples Drawn With or Without Replacement<a class="anchor-link" href="#Random-Samples-Drawn-With-or-Without-Replacement"> </a></h3><p>In this course, we will mostly deal with the two most straightforward methods of sampling.</p>
  346. <p>The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of <code>np.random.choice</code> when it samples from an array.</p>
  347. <p>The other, called a "simple random sample", is a sample drawn at random <em>without</em> replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example.</p>
  348. <p>In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement.</p>
  349. </div>
  350. </div>
  351. </div>
  352. </div>
  353. <div class="jb_cell">
  354. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  355. <div class="text_cell_render border-box-sizing rendered_html">
  356. <p>Drawing a random sample requires care and precision. It is not haphazard, even though that is a colloquial meaning of the word "random". If you stand at a street corner and take as your sample the first ten people who pass by, you might think you're sampling at random because you didn't choose who walked by. But it's not a random sample – it's a <em>sample of convenience</em>. You didn't know ahead of time the probability of each person entering the sample; perhaps you hadn't even specified exactly who was in the population.</p>
  357. </div>
  358. </div>
  359. </div>
  360. </div>