Empirical_Distributions.html 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417
  1. ---
  2. redirect_from:
  3. - "/chapters/10/1/empirical-distributions"
  4. interact_link: content/chapters/10/1/Empirical_Distributions.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Empirical Distributions
  9. prev_page:
  10. url: /chapters/10/Sampling_and_Empirical_Distributions.html
  11. title: |-
  12. Sampling and Empirical Distributions
  13. next_page:
  14. url: /chapters/10/2/Sampling_from_a_Population.html
  15. title: |-
  16. Sampling from a Population
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell">
  24. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  25. <div class="text_cell_render border-box-sizing rendered_html">
  26. <h3 id="Empirical-Distributions">Empirical Distributions<a class="anchor-link" href="#Empirical-Distributions"> </a></h3><p>In data science, the word "empirical" means "observed". Empirical distributions are distributions of observed data, such as data in random samples.</p>
  27. <p>In this section we will generate data and see what the empirical distribution looks like.</p>
  28. <p>Our setting is a simple experiment: rolling a die multiple times and keeping track of which face appears. The table <code>die</code> contains the numbers of spots on the faces of a die. All the numbers appear exactly once, as we are assuming that the die is fair.</p>
  29. </div>
  30. </div>
  31. </div>
  32. </div>
  33. <div class="jb_cell">
  34. <div class="cell border-box-sizing code_cell rendered">
  35. <div class="input">
  36. <div class="inner_cell">
  37. <div class="input_area">
  38. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">die</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">&#39;Face&#39;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
  39. <span class="n">die</span>
  40. </pre></div>
  41. </div>
  42. </div>
  43. </div>
  44. <div class="output_wrapper">
  45. <div class="output">
  46. <div class="jb_output_wrapper }}">
  47. <div class="output_area">
  48. <div class="output_html rendered_html output_subarea output_execute_result">
  49. <table border="1" class="dataframe">
  50. <thead>
  51. <tr>
  52. <th>Face</th>
  53. </tr>
  54. </thead>
  55. <tbody>
  56. <tr>
  57. <td>1 </td>
  58. </tr>
  59. <tr>
  60. <td>2 </td>
  61. </tr>
  62. <tr>
  63. <td>3 </td>
  64. </tr>
  65. <tr>
  66. <td>4 </td>
  67. </tr>
  68. <tr>
  69. <td>5 </td>
  70. </tr>
  71. <tr>
  72. <td>6 </td>
  73. </tr>
  74. </tbody>
  75. </table>
  76. </div>
  77. </div>
  78. </div>
  79. </div>
  80. </div>
  81. </div>
  82. </div>
  83. <div class="jb_cell">
  84. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  85. <div class="text_cell_render border-box-sizing rendered_html">
  86. <h3 id="A-Probability-Distribution">A Probability Distribution<a class="anchor-link" href="#A-Probability-Distribution"> </a></h3><p>The histogram below helps us visualize the fact that every face appears with probability 1/6. We say that the histogram shows the <em>distribution</em> of probabilities over all the possible faces. Since all the bars represent the same percent chance, the distribution is called <em>uniform on the integers 1 through 6.</em></p>
  87. </div>
  88. </div>
  89. </div>
  90. </div>
  91. <div class="jb_cell">
  92. <div class="cell border-box-sizing code_cell rendered">
  93. <div class="input">
  94. <div class="inner_cell">
  95. <div class="input_area">
  96. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">die_bins</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">6.6</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
  97. <span class="n">die</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span> <span class="o">=</span> <span class="n">die_bins</span><span class="p">)</span>
  98. </pre></div>
  99. </div>
  100. </div>
  101. </div>
  102. <div class="output_wrapper">
  103. <div class="output">
  104. <div class="jb_output_wrapper }}">
  105. <div class="output_area">
  106. <div class="output_png output_subarea ">
  107. <img src="../../../images/chapters/10/1/Empirical_Distributions_4_0.png"
  108. >
  109. </div>
  110. </div>
  111. </div>
  112. </div>
  113. </div>
  114. </div>
  115. </div>
  116. <div class="jb_cell">
  117. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  118. <div class="text_cell_render border-box-sizing rendered_html">
  119. <p>Variables whose successive values are separated by the same fixed amount, such as the values on rolls of a die (successive values separated by 1), fall into a class of variables that are called <em>discrete</em>. The histogram above is called a <em>discrete</em> histogram. Its bins are specified by the array <code>die_bins</code> and ensure that each bar is centered over the corresponding integer value.</p>
  120. <p>It is important to remember that the die can't show 1.3 spots, or 5.2 spots – it always shows an integer number of spots. But our visualization spreads the probability of each value over the area of a bar. While this might seem a bit arbitrary at this stage of the course, it will become important later when we overlay smooth curves over discrete histograms.</p>
  121. <p>Before going further, let's make sure that the numbers on the axes make sense. The probability of each face is 1/6, which is 16.67% when rounded to two decimal places. The width of each bin is 1 unit. So the height of each bar is 16.67% per unit. This agrees with the horizontal and vertical scales of the graph.</p>
  122. </div>
  123. </div>
  124. </div>
  125. </div>
  126. <div class="jb_cell">
  127. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  128. <div class="text_cell_render border-box-sizing rendered_html">
  129. <h3 id="Empirical-Distributions">Empirical Distributions<a class="anchor-link" href="#Empirical-Distributions"> </a></h3><p>The distribution above consists of the theoretical probability of each face. It is not based on data. It can be studied and understood without any dice being rolled.</p>
  130. <p><em>Empirical distributions,</em> on the other hand, are distributions of observed data. They can be visualized by <em>empirical histograms</em>.</p>
  131. <p>Let us get some data by simulating rolls of a die. This can be done by sampling at random with replacement from the integers 1 through 6. We have used <code>np.random.choice</code> for such simulations before. But now we will introduce a Table method for doing this. This will make it possible for us to use our familiar Table methods for visualization.</p>
  132. <p>The Table method is called <code>sample</code>. It draws at random with replacement from the rows of a table. Its argument is the sample size, and it returns a table consisting of the rows that were selected. An optional argument <code>with_replacement=False</code> specifies that the sample should be drawn without replacement, but that does not apply to rolling a die.</p>
  133. <p>Here are the results of 10 rolls of a die.</p>
  134. </div>
  135. </div>
  136. </div>
  137. </div>
  138. <div class="jb_cell">
  139. <div class="cell border-box-sizing code_cell rendered">
  140. <div class="input">
  141. <div class="inner_cell">
  142. <div class="input_area">
  143. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">die</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
  144. </pre></div>
  145. </div>
  146. </div>
  147. </div>
  148. <div class="output_wrapper">
  149. <div class="output">
  150. <div class="jb_output_wrapper }}">
  151. <div class="output_area">
  152. <div class="output_html rendered_html output_subarea output_execute_result">
  153. <table border="1" class="dataframe">
  154. <thead>
  155. <tr>
  156. <th>Face</th>
  157. </tr>
  158. </thead>
  159. <tbody>
  160. <tr>
  161. <td>1 </td>
  162. </tr>
  163. <tr>
  164. <td>2 </td>
  165. </tr>
  166. <tr>
  167. <td>3 </td>
  168. </tr>
  169. <tr>
  170. <td>4 </td>
  171. </tr>
  172. <tr>
  173. <td>4 </td>
  174. </tr>
  175. <tr>
  176. <td>1 </td>
  177. </tr>
  178. <tr>
  179. <td>1 </td>
  180. </tr>
  181. <tr>
  182. <td>2 </td>
  183. </tr>
  184. <tr>
  185. <td>6 </td>
  186. </tr>
  187. <tr>
  188. <td>2 </td>
  189. </tr>
  190. </tbody>
  191. </table>
  192. </div>
  193. </div>
  194. </div>
  195. </div>
  196. </div>
  197. </div>
  198. </div>
  199. <div class="jb_cell">
  200. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  201. <div class="text_cell_render border-box-sizing rendered_html">
  202. <p>We can use the same method to simulate as many rolls as we like, and then draw empirical histograms of the results. Because we are going to do this repeatedly, we define a function <code>empirical_hist_die</code> that takes the sample size as its argument, rolls a die as many times as its argument, and then draws a histogram of the observed results.</p>
  203. </div>
  204. </div>
  205. </div>
  206. </div>
  207. <div class="jb_cell">
  208. <div class="cell border-box-sizing code_cell rendered">
  209. <div class="input">
  210. <div class="inner_cell">
  211. <div class="input_area">
  212. <div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">empirical_hist_die</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
  213. <span class="n">die</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span> <span class="o">=</span> <span class="n">die_bins</span><span class="p">)</span>
  214. </pre></div>
  215. </div>
  216. </div>
  217. </div>
  218. </div>
  219. </div>
  220. <div class="jb_cell">
  221. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  222. <div class="text_cell_render border-box-sizing rendered_html">
  223. <h3 id="Empirical-Histograms">Empirical Histograms<a class="anchor-link" href="#Empirical-Histograms"> </a></h3><p>Here is an empirical histogram of 10 rolls. It doesn't look very much like the probability histogram above. Run the cell a few times to see how it varies.</p>
  224. </div>
  225. </div>
  226. </div>
  227. </div>
  228. <div class="jb_cell">
  229. <div class="cell border-box-sizing code_cell rendered">
  230. <div class="input">
  231. <div class="inner_cell">
  232. <div class="input_area">
  233. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_die</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
  234. </pre></div>
  235. </div>
  236. </div>
  237. </div>
  238. <div class="output_wrapper">
  239. <div class="output">
  240. <div class="jb_output_wrapper }}">
  241. <div class="output_area">
  242. <div class="output_png output_subarea ">
  243. <img src="../../../images/chapters/10/1/Empirical_Distributions_11_0.png"
  244. >
  245. </div>
  246. </div>
  247. </div>
  248. </div>
  249. </div>
  250. </div>
  251. </div>
  252. <div class="jb_cell">
  253. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  254. <div class="text_cell_render border-box-sizing rendered_html">
  255. <p>When the sample size increases, the empirical histogram begins to look more like the histogram of theoretical probabilities.</p>
  256. </div>
  257. </div>
  258. </div>
  259. </div>
  260. <div class="jb_cell">
  261. <div class="cell border-box-sizing code_cell rendered">
  262. <div class="input">
  263. <div class="inner_cell">
  264. <div class="input_area">
  265. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_die</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
  266. </pre></div>
  267. </div>
  268. </div>
  269. </div>
  270. <div class="output_wrapper">
  271. <div class="output">
  272. <div class="jb_output_wrapper }}">
  273. <div class="output_area">
  274. <div class="output_png output_subarea ">
  275. <img src="../../../images/chapters/10/1/Empirical_Distributions_13_0.png"
  276. >
  277. </div>
  278. </div>
  279. </div>
  280. </div>
  281. </div>
  282. </div>
  283. </div>
  284. <div class="jb_cell">
  285. <div class="cell border-box-sizing code_cell rendered">
  286. <div class="input">
  287. <div class="inner_cell">
  288. <div class="input_area">
  289. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_hist_die</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
  290. </pre></div>
  291. </div>
  292. </div>
  293. </div>
  294. <div class="output_wrapper">
  295. <div class="output">
  296. <div class="jb_output_wrapper }}">
  297. <div class="output_area">
  298. <div class="output_png output_subarea ">
  299. <img src="../../../images/chapters/10/1/Empirical_Distributions_14_0.png"
  300. >
  301. </div>
  302. </div>
  303. </div>
  304. </div>
  305. </div>
  306. </div>
  307. </div>
  308. <div class="jb_cell">
  309. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  310. <div class="text_cell_render border-box-sizing rendered_html">
  311. <p>As we increase the number of rolls in the simulation, the area of each bar gets closer to 16.67%, which is the area of each bar in the probability histogram.</p>
  312. <h3 id="The-Law-of-Averages">The Law of Averages<a class="anchor-link" href="#The-Law-of-Averages"> </a></h3><p>What we have observed above is an instance of a general rule.</p>
  313. <p>If a chance experiment is repeated independently and under identical conditions, then, in the long run, the proportion of times that an event occurs gets closer and closer to the theoretical probability of the event.</p>
  314. <p>For example, in the long run, the proportion of times the face with four spots appears gets closer and closer to 1/6.</p>
  315. <p>Here "independently and under identical conditions" means that every repetition is performed in the same way regardless of the results of all the other repetitions.</p>
  316. </div>
  317. </div>
  318. </div>
  319. </div>