Sampling_from_a_Population.html 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523
  1. ---
  2. redirect_from:
  3. - "/chapters/10/2/sampling-from-a-population"
  4. interact_link: content/chapters/10/2/Sampling_from_a_Population.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Sampling from a Population
  9. prev_page:
  10. url: /chapters/10/1/Empirical_Distributions.html
  11. title: |-
  12. Empirical Distributions
  13. next_page:
  14. url: /chapters/10/3/Empirical_Distribution_of_a_Statistic.html
  15. title: |-
  16. Empirical Distibution of a Statistic
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell">
  24. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  25. <div class="text_cell_render border-box-sizing rendered_html">
  26. <h3 id="Sampling-from-a-Population">Sampling from a Population<a class="anchor-link" href="#Sampling-from-a-Population"> </a></h3><p>The law of averages also holds when the random sample is drawn from individuals in a large population.</p>
  27. <p>As an example, we will study a population of flight delay times. The table <code>united</code> contains data for United Airlines domestic flights departing from San Francisco in the summer of 2015. The data are made publicly available by the <a href="http://www.transtats.bts.gov/Fields.asp?Table_ID=293">Bureau of Transportation Statistics</a> in the United States Department of Transportation.</p>
  28. <p>There are 13,825 rows, each corresponding to a flight. The columns are the date of the flight, the flight number, the destination airport code, and the departure delay time in minutes. Some delay times are negative; those flights left early.</p>
  29. </div>
  30. </div>
  31. </div>
  32. </div>
  33. <div class="jb_cell">
  34. <div class="cell border-box-sizing code_cell rendered">
  35. <div class="input">
  36. <div class="inner_cell">
  37. <div class="input_area">
  38. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">united</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;united_summer2015.csv&#39;</span><span class="p">)</span>
  39. <span class="n">united</span>
  40. </pre></div>
  41. </div>
  42. </div>
  43. </div>
  44. <div class="output_wrapper">
  45. <div class="output">
  46. <div class="jb_output_wrapper }}">
  47. <div class="output_area">
  48. <div class="output_html rendered_html output_subarea output_execute_result">
  49. <table border="1" class="dataframe">
  50. <thead>
  51. <tr>
  52. <th>Date</th> <th>Flight Number</th> <th>Destination</th> <th>Delay</th>
  53. </tr>
  54. </thead>
  55. <tbody>
  56. <tr>
  57. <td>6/1/15</td> <td>73 </td> <td>HNL </td> <td>257 </td>
  58. </tr>
  59. <tr>
  60. <td>6/1/15</td> <td>217 </td> <td>EWR </td> <td>28 </td>
  61. </tr>
  62. <tr>
  63. <td>6/1/15</td> <td>237 </td> <td>STL </td> <td>-3 </td>
  64. </tr>
  65. <tr>
  66. <td>6/1/15</td> <td>250 </td> <td>SAN </td> <td>0 </td>
  67. </tr>
  68. <tr>
  69. <td>6/1/15</td> <td>267 </td> <td>PHL </td> <td>64 </td>
  70. </tr>
  71. <tr>
  72. <td>6/1/15</td> <td>273 </td> <td>SEA </td> <td>-6 </td>
  73. </tr>
  74. <tr>
  75. <td>6/1/15</td> <td>278 </td> <td>SEA </td> <td>-8 </td>
  76. </tr>
  77. <tr>
  78. <td>6/1/15</td> <td>292 </td> <td>EWR </td> <td>12 </td>
  79. </tr>
  80. <tr>
  81. <td>6/1/15</td> <td>300 </td> <td>HNL </td> <td>20 </td>
  82. </tr>
  83. <tr>
  84. <td>6/1/15</td> <td>317 </td> <td>IND </td> <td>-10 </td>
  85. </tr>
  86. </tbody>
  87. </table>
  88. <p>... (13815 rows omitted)</p>
  89. </div>
  90. </div>
  91. </div>
  92. </div>
  93. </div>
  94. </div>
  95. </div>
  96. <div class="jb_cell">
  97. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  98. <div class="text_cell_render border-box-sizing rendered_html">
  99. <p>One flight departed 16 minutes early, and one was 580 minutes late. The other delay times were almost all between -10 minutes and 200 minutes, as the histogram below shows.</p>
  100. </div>
  101. </div>
  102. </div>
  103. </div>
  104. <div class="jb_cell">
  105. <div class="cell border-box-sizing code_cell rendered">
  106. <div class="input">
  107. <div class="inner_cell">
  108. <div class="input_area">
  109. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">united</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
  110. </pre></div>
  111. </div>
  112. </div>
  113. </div>
  114. <div class="output_wrapper">
  115. <div class="output">
  116. <div class="jb_output_wrapper }}">
  117. <div class="output_area">
  118. <div class="output_text output_subarea output_execute_result">
  119. <pre>-16</pre>
  120. </div>
  121. </div>
  122. </div>
  123. </div>
  124. </div>
  125. </div>
  126. </div>
  127. <div class="jb_cell">
  128. <div class="cell border-box-sizing code_cell rendered">
  129. <div class="input">
  130. <div class="inner_cell">
  131. <div class="input_area">
  132. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">united</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
  133. </pre></div>
  134. </div>
  135. </div>
  136. </div>
  137. <div class="output_wrapper">
  138. <div class="output">
  139. <div class="jb_output_wrapper }}">
  140. <div class="output_area">
  141. <div class="output_text output_subarea output_execute_result">
  142. <pre>580</pre>
  143. </div>
  144. </div>
  145. </div>
  146. </div>
  147. </div>
  148. </div>
  149. </div>
  150. <div class="jb_cell">
  151. <div class="cell border-box-sizing code_cell rendered">
  152. <div class="input">
  153. <div class="inner_cell">
  154. <div class="input_area">
  155. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">delay_bins</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">20</span><span class="p">,</span> <span class="mi">301</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="mi">600</span><span class="p">)</span>
  156. <span class="n">united</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">,</span> <span class="n">bins</span> <span class="o">=</span> <span class="n">delay_bins</span><span class="p">,</span> <span class="n">unit</span> <span class="o">=</span> <span class="s1">&#39;minute&#39;</span><span class="p">)</span>
  157. </pre></div>
  158. </div>
  159. </div>
  160. </div>
  161. <div class="output_wrapper">
  162. <div class="output">
  163. <div class="jb_output_wrapper }}">
  164. <div class="output_area">
  165. <div class="output_png output_subarea ">
  166. <img src="../../../images/chapters/10/2/Sampling_from_a_Population_6_0.png"
  167. >
  168. </div>
  169. </div>
  170. </div>
  171. </div>
  172. </div>
  173. </div>
  174. </div>
  175. <div class="jb_cell">
  176. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  177. <div class="text_cell_render border-box-sizing rendered_html">
  178. <p>For the purposes of this section, it is enough to zoom in on the bulk of the data and ignore the 0.8% of flights that had delays of more than 200 minutes. This restriction is just for visual convenience; the table still retains all the data.</p>
  179. </div>
  180. </div>
  181. </div>
  182. </div>
  183. <div class="jb_cell">
  184. <div class="cell border-box-sizing code_cell rendered">
  185. <div class="input">
  186. <div class="inner_cell">
  187. <div class="input_area">
  188. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">united</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">above</span><span class="p">(</span><span class="mi">200</span><span class="p">))</span><span class="o">.</span><span class="n">num_rows</span><span class="o">/</span><span class="n">united</span><span class="o">.</span><span class="n">num_rows</span>
  189. </pre></div>
  190. </div>
  191. </div>
  192. </div>
  193. <div class="output_wrapper">
  194. <div class="output">
  195. <div class="jb_output_wrapper }}">
  196. <div class="output_area">
  197. <div class="output_text output_subarea output_execute_result">
  198. <pre>0.008390596745027125</pre>
  199. </div>
  200. </div>
  201. </div>
  202. </div>
  203. </div>
  204. </div>
  205. </div>
  206. <div class="jb_cell">
  207. <div class="cell border-box-sizing code_cell rendered">
  208. <div class="input">
  209. <div class="inner_cell">
  210. <div class="input_area">
  211. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">delay_bins</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">20</span><span class="p">,</span> <span class="mi">201</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
  212. <span class="n">united</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">,</span> <span class="n">bins</span> <span class="o">=</span> <span class="n">delay_bins</span><span class="p">,</span> <span class="n">unit</span> <span class="o">=</span> <span class="s1">&#39;minute&#39;</span><span class="p">)</span>
  213. </pre></div>
  214. </div>
  215. </div>
  216. </div>
  217. <div class="output_wrapper">
  218. <div class="output">
  219. <div class="jb_output_wrapper }}">
  220. <div class="output_area">
  221. <div class="output_png output_subarea ">
  222. <img src="../../../images/chapters/10/2/Sampling_from_a_Population_9_0.png"
  223. >
  224. </div>
  225. </div>
  226. </div>
  227. </div>
  228. </div>
  229. </div>
  230. </div>
  231. <div class="jb_cell">
  232. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  233. <div class="text_cell_render border-box-sizing rendered_html">
  234. <p>The height of the [0, 10) bar is just under 3% per minute, which means that just under 30% of the flights had delays between 0 and 10 minutes. That is confirmed by counting rows:</p>
  235. </div>
  236. </div>
  237. </div>
  238. </div>
  239. <div class="jb_cell">
  240. <div class="cell border-box-sizing code_cell rendered">
  241. <div class="input">
  242. <div class="inner_cell">
  243. <div class="input_area">
  244. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">united</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">between</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span><span class="o">.</span><span class="n">num_rows</span><span class="o">/</span><span class="n">united</span><span class="o">.</span><span class="n">num_rows</span>
  245. </pre></div>
  246. </div>
  247. </div>
  248. </div>
  249. <div class="output_wrapper">
  250. <div class="output">
  251. <div class="jb_output_wrapper }}">
  252. <div class="output_area">
  253. <div class="output_text output_subarea output_execute_result">
  254. <pre>0.2935985533453888</pre>
  255. </div>
  256. </div>
  257. </div>
  258. </div>
  259. </div>
  260. </div>
  261. </div>
  262. <div class="jb_cell">
  263. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  264. <div class="text_cell_render border-box-sizing rendered_html">
  265. <h3 id="Empirical-Distribution-of-the-Sample">Empirical Distribution of the Sample<a class="anchor-link" href="#Empirical-Distribution-of-the-Sample"> </a></h3><p>Let us now think of the 13,825 flights as a population, and draw random samples from it with replacement. It is helpful to package our code into a function. The function <code>empirical_hist_delay</code> takes the sample size as its argument and draws an empiricial histogram of the results.</p>
  266. </div>
  267. </div>
  268. </div>
  269. </div>
  270. <div class="jb_cell">
  271. <div class="cell border-box-sizing code_cell rendered">
  272. <div class="input">
  273. <div class="inner_cell">
  274. <div class="input_area">
  275. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">empirical_hist_delay</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
  276. <span class="n">united</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">&#39;Delay&#39;</span><span class="p">,</span> <span class="n">bins</span> <span class="o">=</span> <span class="n">delay_bins</span><span class="p">,</span> <span class="n">unit</span> <span class="o">=</span> <span class="s1">&#39;minute&#39;</span><span class="p">)</span>
  277. </pre></div>
  278. </div>
  279. </div>
  280. </div>
  281. </div>
  282. </div>
  283. <div class="jb_cell">
  284. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  285. <div class="text_cell_render border-box-sizing rendered_html">
  286. <p>As we saw with the dice, as the sample size increases, the empirical histogram of the sample more closely resembles the histogram of the population. Compare these histograms to the population histogram above.</p>
  287. </div>
  288. </div>
  289. </div>
  290. </div>
  291. <div class="jb_cell">
  292. <div class="cell border-box-sizing code_cell rendered">
  293. <div class="input">
  294. <div class="inner_cell">
  295. <div class="input_area">
  296. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_delay</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
  297. </pre></div>
  298. </div>
  299. </div>
  300. </div>
  301. <div class="output_wrapper">
  302. <div class="output">
  303. <div class="jb_output_wrapper }}">
  304. <div class="output_area">
  305. <div class="output_png output_subarea ">
  306. <img src="../../../images/chapters/10/2/Sampling_from_a_Population_15_0.png"
  307. >
  308. </div>
  309. </div>
  310. </div>
  311. </div>
  312. </div>
  313. </div>
  314. </div>
  315. <div class="jb_cell">
  316. <div class="cell border-box-sizing code_cell rendered">
  317. <div class="input">
  318. <div class="inner_cell">
  319. <div class="input_area">
  320. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_delay</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
  321. </pre></div>
  322. </div>
  323. </div>
  324. </div>
  325. <div class="output_wrapper">
  326. <div class="output">
  327. <div class="jb_output_wrapper }}">
  328. <div class="output_area">
  329. <div class="output_png output_subarea ">
  330. <img src="../../../images/chapters/10/2/Sampling_from_a_Population_16_0.png"
  331. >
  332. </div>
  333. </div>
  334. </div>
  335. </div>
  336. </div>
  337. </div>
  338. </div>
  339. <div class="jb_cell">
  340. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  341. <div class="text_cell_render border-box-sizing rendered_html">
  342. <p>The most consistently visible discrepancies are among the values that are rare in the population. In our example, those values are in the the right hand tail of the distribution. But as the sample size increases, even those values begin to appear in the sample in roughly the correct proportions.</p>
  343. </div>
  344. </div>
  345. </div>
  346. </div>
  347. <div class="jb_cell">
  348. <div class="cell border-box-sizing code_cell rendered">
  349. <div class="input">
  350. <div class="inner_cell">
  351. <div class="input_area">
  352. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_delay</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
  353. </pre></div>
  354. </div>
  355. </div>
  356. </div>
  357. <div class="output_wrapper">
  358. <div class="output">
  359. <div class="jb_output_wrapper }}">
  360. <div class="output_area">
  361. <div class="output_png output_subarea ">
  362. <img src="../../../images/chapters/10/2/Sampling_from_a_Population_18_0.png"
  363. >
  364. </div>
  365. </div>
  366. </div>
  367. </div>
  368. </div>
  369. </div>
  370. </div>
  371. <div class="jb_cell">
  372. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  373. <div class="text_cell_render border-box-sizing rendered_html">
  374. <h3 id="Convergence-of-the-Empirical-Histogram-of-the-Sample">Convergence of the Empirical Histogram of the Sample<a class="anchor-link" href="#Convergence-of-the-Empirical-Histogram-of-the-Sample"> </a></h3><p>What we have observed in this section can be summarized as follows:</p>
  375. <p>For a large random sample, the empirical histogram of the sample resembles the histogram of the population, with high probability.</p>
  376. <p>This justifies the use of large random samples in statistical inference. The idea is that since a large random sample is likely to resemble the population from which it is drawn, quantities computed from the values in the sample are likely to be close to the corresponding quantities in the population.</p>
  377. </div>
  378. </div>
  379. </div>
  380. </div>