Classifying_by_One_Variable.html 31 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943
  1. ---
  2. redirect_from:
  3. - "/chapters/08/2/classifying-by-one-variable"
  4. interact_link: content/chapters/08/2/Classifying_by_One_Variable.ipynb
  5. kernel_name: python3
  6. has_widgets: false
  7. title: |-
  8. Classifying by One Variable
  9. prev_page:
  10. url: /chapters/08/1/Applying_a_Function_to_a_Column.html
  11. title: |-
  12. Applying Functions to Columns
  13. next_page:
  14. url: /chapters/08/3/Cross-Classifying_by_More_than_One_Variable.html
  15. title: |-
  16. Cross-Classifying
  17. comment: "***PROGRAMMATICALLY GENERATED, DO NOT EDIT. SEE ORIGINAL FILES IN /content***"
  18. ---
  19. <div class="jb_cell tag_remove_input">
  20. <div class="cell border-box-sizing code_cell rendered">
  21. </div>
  22. </div>
  23. <div class="jb_cell">
  24. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  25. <div class="text_cell_render border-box-sizing rendered_html">
  26. <h3 id="Classifying-by-One-Variable">Classifying by One Variable<a class="anchor-link" href="#Classifying-by-One-Variable"> </a></h3><p>Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton's data on heights, we saw that it was useful to classify families according to the parents' midparent heights, and then find the average height of the children in each group.</p>
  27. <p>This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of <code>group</code>.</p>
  28. </div>
  29. </div>
  30. </div>
  31. </div>
  32. <div class="jb_cell">
  33. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  34. <div class="text_cell_render border-box-sizing rendered_html">
  35. <h3 id="Counting-the-Number-in-Each-Category">Counting the Number in Each Category<a class="anchor-link" href="#Counting-the-Number-in-Each-Category"> </a></h3><p>The <code>group</code> method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.</p>
  36. <p>Here is a small table of data on ice cream cones. The <code>group</code> method can be used to list the distinct flavors and provide the counts of each flavor.</p>
  37. </div>
  38. </div>
  39. </div>
  40. </div>
  41. <div class="jb_cell">
  42. <div class="cell border-box-sizing code_cell rendered">
  43. <div class="input">
  44. <div class="inner_cell">
  45. <div class="input_area">
  46. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  47. <span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="s1">&#39;strawberry&#39;</span><span class="p">,</span> <span class="s1">&#39;chocolate&#39;</span><span class="p">,</span> <span class="s1">&#39;chocolate&#39;</span><span class="p">,</span> <span class="s1">&#39;strawberry&#39;</span><span class="p">,</span> <span class="s1">&#39;chocolate&#39;</span><span class="p">),</span>
  48. <span class="s1">&#39;Price&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="mf">3.55</span><span class="p">,</span> <span class="mf">4.75</span><span class="p">,</span> <span class="mf">6.55</span><span class="p">,</span> <span class="mf">5.25</span><span class="p">,</span> <span class="mf">5.25</span><span class="p">)</span>
  49. <span class="p">)</span>
  50. <span class="n">cones</span>
  51. </pre></div>
  52. </div>
  53. </div>
  54. </div>
  55. <div class="output_wrapper">
  56. <div class="output">
  57. <div class="jb_output_wrapper }}">
  58. <div class="output_area">
  59. <div class="output_html rendered_html output_subarea output_execute_result">
  60. <table border="1" class="dataframe">
  61. <thead>
  62. <tr>
  63. <th>Flavor</th> <th>Price</th>
  64. </tr>
  65. </thead>
  66. <tbody>
  67. <tr>
  68. <td>strawberry</td> <td>3.55 </td>
  69. </tr>
  70. <tr>
  71. <td>chocolate </td> <td>4.75 </td>
  72. </tr>
  73. <tr>
  74. <td>chocolate </td> <td>6.55 </td>
  75. </tr>
  76. <tr>
  77. <td>strawberry</td> <td>5.25 </td>
  78. </tr>
  79. <tr>
  80. <td>chocolate </td> <td>5.25 </td>
  81. </tr>
  82. </tbody>
  83. </table>
  84. </div>
  85. </div>
  86. </div>
  87. </div>
  88. </div>
  89. </div>
  90. </div>
  91. <div class="jb_cell">
  92. <div class="cell border-box-sizing code_cell rendered">
  93. <div class="input">
  94. <div class="inner_cell">
  95. <div class="input_area">
  96. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">)</span>
  97. </pre></div>
  98. </div>
  99. </div>
  100. </div>
  101. <div class="output_wrapper">
  102. <div class="output">
  103. <div class="jb_output_wrapper }}">
  104. <div class="output_area">
  105. <div class="output_html rendered_html output_subarea output_execute_result">
  106. <table border="1" class="dataframe">
  107. <thead>
  108. <tr>
  109. <th>Flavor</th> <th>count</th>
  110. </tr>
  111. </thead>
  112. <tbody>
  113. <tr>
  114. <td>chocolate </td> <td>3 </td>
  115. </tr>
  116. <tr>
  117. <td>strawberry</td> <td>2 </td>
  118. </tr>
  119. </tbody>
  120. </table>
  121. </div>
  122. </div>
  123. </div>
  124. </div>
  125. </div>
  126. </div>
  127. </div>
  128. <div class="jb_cell">
  129. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  130. <div class="text_cell_render border-box-sizing rendered_html">
  131. <p>There are two distinct categories, chocolate and strawberry. The call to <code>group</code> creates a table of counts in each category. The column is called <code>count</code> by default, and contains the number of rows in each category.</p>
  132. <p>Notice that this can all be worked out from just the <code>Flavor</code> column. The <code>Price</code> column has not been used.</p>
  133. <p>But what if we wanted the total price of the cones of each different flavor? That's where the second argument of <code>group</code> comes in.</p>
  134. </div>
  135. </div>
  136. </div>
  137. </div>
  138. <div class="jb_cell">
  139. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  140. <div class="text_cell_render border-box-sizing rendered_html">
  141. <h3 id="Finding-a-Characteristic-of-Each-Category">Finding a Characteristic of Each Category<a class="anchor-link" href="#Finding-a-Characteristic-of-Each-Category"> </a></h3><p>The optional second argument of <code>group</code> names the function that will be used to aggregate values in other columns for all of those rows. For instance, <code>sum</code> will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.</p>
  142. <p>To find the total price of each flavor, we call <code>group</code> again, with <code>Flavor</code> as its first argument as before. But this time there is a second argument: the function name <code>sum</code>.</p>
  143. </div>
  144. </div>
  145. </div>
  146. </div>
  147. <div class="jb_cell">
  148. <div class="cell border-box-sizing code_cell rendered">
  149. <div class="input">
  150. <div class="inner_cell">
  151. <div class="input_area">
  152. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="nb">sum</span><span class="p">)</span>
  153. </pre></div>
  154. </div>
  155. </div>
  156. </div>
  157. <div class="output_wrapper">
  158. <div class="output">
  159. <div class="jb_output_wrapper }}">
  160. <div class="output_area">
  161. <div class="output_html rendered_html output_subarea output_execute_result">
  162. <table border="1" class="dataframe">
  163. <thead>
  164. <tr>
  165. <th>Flavor</th> <th>Price sum</th>
  166. </tr>
  167. </thead>
  168. <tbody>
  169. <tr>
  170. <td>chocolate </td> <td>16.55 </td>
  171. </tr>
  172. <tr>
  173. <td>strawberry</td> <td>8.8 </td>
  174. </tr>
  175. </tbody>
  176. </table>
  177. </div>
  178. </div>
  179. </div>
  180. </div>
  181. </div>
  182. </div>
  183. </div>
  184. <div class="jb_cell">
  185. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  186. <div class="text_cell_render border-box-sizing rendered_html">
  187. <p>To create this new table, <code>group</code> has calculated the sum of the <code>Price</code> entries in all the rows corresponding to each distinct flavor. The prices in the three <code>chocolate</code> rows add up to $\$16.55$ (you can assume that price is being measured in dollars). The prices in the two `strawberry` rows have a total of $\$8.80$.</p>
  188. <p>The label of the newly created "sum" column is <code>Price sum</code>, which is created by taking the label of the column being summed, and appending the word <code>sum</code>.</p>
  189. <p>Because <code>group</code> finds the <code>sum</code> of all columns other than the one with the categories, there is no need to specify that it has to <code>sum</code> the prices.</p>
  190. <p>To see in more detail what <code>group</code> is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:</p>
  191. </div>
  192. </div>
  193. </div>
  194. </div>
  195. <div class="jb_cell">
  196. <div class="cell border-box-sizing code_cell rendered">
  197. <div class="input">
  198. <div class="inner_cell">
  199. <div class="input_area">
  200. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">equal_to</span><span class="p">(</span><span class="s1">&#39;chocolate&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Price&#39;</span><span class="p">)</span>
  201. </pre></div>
  202. </div>
  203. </div>
  204. </div>
  205. <div class="output_wrapper">
  206. <div class="output">
  207. <div class="jb_output_wrapper }}">
  208. <div class="output_area">
  209. <div class="output_text output_subarea output_execute_result">
  210. <pre>array([4.75, 6.55, 5.25])</pre>
  211. </div>
  212. </div>
  213. </div>
  214. </div>
  215. </div>
  216. </div>
  217. </div>
  218. <div class="jb_cell">
  219. <div class="cell border-box-sizing code_cell rendered">
  220. <div class="input">
  221. <div class="inner_cell">
  222. <div class="input_area">
  223. <div class=" highlight hl-ipython3"><pre><span></span><span class="nb">sum</span><span class="p">(</span><span class="n">cones</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">equal_to</span><span class="p">(</span><span class="s1">&#39;chocolate&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Price&#39;</span><span class="p">))</span>
  224. </pre></div>
  225. </div>
  226. </div>
  227. </div>
  228. <div class="output_wrapper">
  229. <div class="output">
  230. <div class="jb_output_wrapper }}">
  231. <div class="output_area">
  232. <div class="output_text output_subarea output_execute_result">
  233. <pre>16.55</pre>
  234. </div>
  235. </div>
  236. </div>
  237. </div>
  238. </div>
  239. </div>
  240. </div>
  241. <div class="jb_cell">
  242. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  243. <div class="text_cell_render border-box-sizing rendered_html">
  244. <p>This is what <code>group</code> is doing for each distinct value in <code>Flavor</code>.</p>
  245. </div>
  246. </div>
  247. </div>
  248. </div>
  249. <div class="jb_cell">
  250. <div class="cell border-box-sizing code_cell rendered">
  251. <div class="input">
  252. <div class="inner_cell">
  253. <div class="input_area">
  254. <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># For each distinct value in `Flavor, access all the rows</span>
  255. <span class="c1"># and create an array of `Price`</span>
  256. <span class="n">cones_choc</span> <span class="o">=</span> <span class="n">cones</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">equal_to</span><span class="p">(</span><span class="s1">&#39;chocolate&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Price&#39;</span><span class="p">)</span>
  257. <span class="n">cones_strawb</span> <span class="o">=</span> <span class="n">cones</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">are</span><span class="o">.</span><span class="n">equal_to</span><span class="p">(</span><span class="s1">&#39;strawberry&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">&#39;Price&#39;</span><span class="p">)</span>
  258. <span class="c1"># Display the arrays in a table</span>
  259. <span class="n">grouped_cones</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
  260. <span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="s1">&#39;chocolate&#39;</span><span class="p">,</span> <span class="s1">&#39;strawberry&#39;</span><span class="p">),</span>
  261. <span class="s1">&#39;Array of All the Prices&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="n">cones_choc</span><span class="p">,</span> <span class="n">cones_strawb</span><span class="p">)</span>
  262. <span class="p">)</span>
  263. <span class="c1"># Append a column with the sum of the `Price` values in each array</span>
  264. <span class="n">price_totals</span> <span class="o">=</span> <span class="n">grouped_cones</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
  265. <span class="s1">&#39;Sum of the Array&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">cones_choc</span><span class="p">),</span> <span class="nb">sum</span><span class="p">(</span><span class="n">cones_strawb</span><span class="p">))</span>
  266. <span class="p">)</span>
  267. <span class="n">price_totals</span>
  268. </pre></div>
  269. </div>
  270. </div>
  271. </div>
  272. <div class="output_wrapper">
  273. <div class="output">
  274. <div class="jb_output_wrapper }}">
  275. <div class="output_area">
  276. <div class="output_html rendered_html output_subarea output_execute_result">
  277. <table border="1" class="dataframe">
  278. <thead>
  279. <tr>
  280. <th>Flavor</th> <th>Array of All the Prices</th> <th>Sum of the Array</th>
  281. </tr>
  282. </thead>
  283. <tbody>
  284. <tr>
  285. <td>chocolate </td> <td>[4.75 6.55 5.25] </td> <td>16.55 </td>
  286. </tr>
  287. <tr>
  288. <td>strawberry</td> <td>[3.55 5.25] </td> <td>8.8 </td>
  289. </tr>
  290. </tbody>
  291. </table>
  292. </div>
  293. </div>
  294. </div>
  295. </div>
  296. </div>
  297. </div>
  298. </div>
  299. <div class="jb_cell">
  300. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  301. <div class="text_cell_render border-box-sizing rendered_html">
  302. <p>You can replace <code>sum</code> by any other functions that work on arrays. For example, you could use <code>max</code> to find the largest price in each category:</p>
  303. </div>
  304. </div>
  305. </div>
  306. </div>
  307. <div class="jb_cell">
  308. <div class="cell border-box-sizing code_cell rendered">
  309. <div class="input">
  310. <div class="inner_cell">
  311. <div class="input_area">
  312. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">cones</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;Flavor&#39;</span><span class="p">,</span> <span class="nb">max</span><span class="p">)</span>
  313. </pre></div>
  314. </div>
  315. </div>
  316. </div>
  317. <div class="output_wrapper">
  318. <div class="output">
  319. <div class="jb_output_wrapper }}">
  320. <div class="output_area">
  321. <div class="output_html rendered_html output_subarea output_execute_result">
  322. <table border="1" class="dataframe">
  323. <thead>
  324. <tr>
  325. <th>Flavor</th> <th>Price max</th>
  326. </tr>
  327. </thead>
  328. <tbody>
  329. <tr>
  330. <td>chocolate </td> <td>6.55 </td>
  331. </tr>
  332. <tr>
  333. <td>strawberry</td> <td>5.25 </td>
  334. </tr>
  335. </tbody>
  336. </table>
  337. </div>
  338. </div>
  339. </div>
  340. </div>
  341. </div>
  342. </div>
  343. </div>
  344. <div class="jb_cell">
  345. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  346. <div class="text_cell_render border-box-sizing rendered_html">
  347. <p>Once again, <code>group</code> creates arrays of the prices in each <code>Flavor</code> category. But now it finds the <code>max</code> of each array:</p>
  348. </div>
  349. </div>
  350. </div>
  351. </div>
  352. <div class="jb_cell">
  353. <div class="cell border-box-sizing code_cell rendered">
  354. <div class="input">
  355. <div class="inner_cell">
  356. <div class="input_area">
  357. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">price_maxes</span> <span class="o">=</span> <span class="n">grouped_cones</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
  358. <span class="s1">&#39;Max of the Array&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">cones_choc</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">cones_strawb</span><span class="p">))</span>
  359. <span class="p">)</span>
  360. <span class="n">price_maxes</span>
  361. </pre></div>
  362. </div>
  363. </div>
  364. </div>
  365. <div class="output_wrapper">
  366. <div class="output">
  367. <div class="jb_output_wrapper }}">
  368. <div class="output_area">
  369. <div class="output_html rendered_html output_subarea output_execute_result">
  370. <table border="1" class="dataframe">
  371. <thead>
  372. <tr>
  373. <th>Flavor</th> <th>Array of All the Prices</th> <th>Max of the Array</th>
  374. </tr>
  375. </thead>
  376. <tbody>
  377. <tr>
  378. <td>chocolate </td> <td>[4.75 6.55 5.25] </td> <td>6.55 </td>
  379. </tr>
  380. <tr>
  381. <td>strawberry</td> <td>[3.55 5.25] </td> <td>5.25 </td>
  382. </tr>
  383. </tbody>
  384. </table>
  385. </div>
  386. </div>
  387. </div>
  388. </div>
  389. </div>
  390. </div>
  391. </div>
  392. <div class="jb_cell">
  393. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  394. <div class="text_cell_render border-box-sizing rendered_html">
  395. <p>Indeed, the original call to <code>group</code> with just one argument has the same effect as using <code>len</code> as the function and then cleaning up the table.</p>
  396. </div>
  397. </div>
  398. </div>
  399. </div>
  400. <div class="jb_cell">
  401. <div class="cell border-box-sizing code_cell rendered">
  402. <div class="input">
  403. <div class="inner_cell">
  404. <div class="input_area">
  405. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">lengths</span> <span class="o">=</span> <span class="n">grouped_cones</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
  406. <span class="s1">&#39;Length of the Array&#39;</span><span class="p">,</span> <span class="n">make_array</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">cones_choc</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">cones_strawb</span><span class="p">))</span>
  407. <span class="p">)</span>
  408. <span class="n">lengths</span>
  409. </pre></div>
  410. </div>
  411. </div>
  412. </div>
  413. <div class="output_wrapper">
  414. <div class="output">
  415. <div class="jb_output_wrapper }}">
  416. <div class="output_area">
  417. <div class="output_html rendered_html output_subarea output_execute_result">
  418. <table border="1" class="dataframe">
  419. <thead>
  420. <tr>
  421. <th>Flavor</th> <th>Array of All the Prices</th> <th>Length of the Array</th>
  422. </tr>
  423. </thead>
  424. <tbody>
  425. <tr>
  426. <td>chocolate </td> <td>[4.75 6.55 5.25] </td> <td>3 </td>
  427. </tr>
  428. <tr>
  429. <td>strawberry</td> <td>[3.55 5.25] </td> <td>2 </td>
  430. </tr>
  431. </tbody>
  432. </table>
  433. </div>
  434. </div>
  435. </div>
  436. </div>
  437. </div>
  438. </div>
  439. </div>
  440. <div class="jb_cell">
  441. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  442. <div class="text_cell_render border-box-sizing rendered_html">
  443. <h3 id="Example:-NBA-Salaries">Example: NBA Salaries<a class="anchor-link" href="#Example:-NBA-Salaries"> </a></h3><p>The table <code>nba</code> contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars.</p>
  444. </div>
  445. </div>
  446. </div>
  447. </div>
  448. <div class="jb_cell">
  449. <div class="cell border-box-sizing code_cell rendered">
  450. <div class="input">
  451. <div class="inner_cell">
  452. <div class="input_area">
  453. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba1</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">path_data</span> <span class="o">+</span> <span class="s1">&#39;nba_salaries.csv&#39;</span><span class="p">)</span>
  454. <span class="n">nba</span> <span class="o">=</span> <span class="n">nba1</span><span class="o">.</span><span class="n">relabeled</span><span class="p">(</span><span class="s2">&quot;&#39;15-&#39;16 SALARY&quot;</span><span class="p">,</span> <span class="s1">&#39;SALARY&#39;</span><span class="p">)</span>
  455. <span class="n">nba</span>
  456. </pre></div>
  457. </div>
  458. </div>
  459. </div>
  460. <div class="output_wrapper">
  461. <div class="output">
  462. <div class="jb_output_wrapper }}">
  463. <div class="output_area">
  464. <div class="output_html rendered_html output_subarea output_execute_result">
  465. <table border="1" class="dataframe">
  466. <thead>
  467. <tr>
  468. <th>PLAYER</th> <th>POSITION</th> <th>TEAM</th> <th>SALARY</th>
  469. </tr>
  470. </thead>
  471. <tbody>
  472. <tr>
  473. <td>Paul Millsap </td> <td>PF </td> <td>Atlanta Hawks</td> <td>18.6717</td>
  474. </tr>
  475. <tr>
  476. <td>Al Horford </td> <td>C </td> <td>Atlanta Hawks</td> <td>12 </td>
  477. </tr>
  478. <tr>
  479. <td>Tiago Splitter </td> <td>C </td> <td>Atlanta Hawks</td> <td>9.75625</td>
  480. </tr>
  481. <tr>
  482. <td>Jeff Teague </td> <td>PG </td> <td>Atlanta Hawks</td> <td>8 </td>
  483. </tr>
  484. <tr>
  485. <td>Kyle Korver </td> <td>SG </td> <td>Atlanta Hawks</td> <td>5.74648</td>
  486. </tr>
  487. <tr>
  488. <td>Thabo Sefolosha </td> <td>SF </td> <td>Atlanta Hawks</td> <td>4 </td>
  489. </tr>
  490. <tr>
  491. <td>Mike Scott </td> <td>PF </td> <td>Atlanta Hawks</td> <td>3.33333</td>
  492. </tr>
  493. <tr>
  494. <td>Kent Bazemore </td> <td>SF </td> <td>Atlanta Hawks</td> <td>2 </td>
  495. </tr>
  496. <tr>
  497. <td>Dennis Schroder </td> <td>PG </td> <td>Atlanta Hawks</td> <td>1.7634 </td>
  498. </tr>
  499. <tr>
  500. <td>Tim Hardaway Jr.</td> <td>SG </td> <td>Atlanta Hawks</td> <td>1.30452</td>
  501. </tr>
  502. </tbody>
  503. </table>
  504. <p>... (407 rows omitted)</p>
  505. </div>
  506. </div>
  507. </div>
  508. </div>
  509. </div>
  510. </div>
  511. </div>
  512. <div class="jb_cell">
  513. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  514. <div class="text_cell_render border-box-sizing rendered_html">
  515. <p><strong>1.</strong> How much money did each team pay for its players' salaries?</p>
  516. <p>The only columns involved are <code>TEAM</code> and <code>SALARY</code>. We have to <code>group</code> the rows by <code>TEAM</code> and then <code>sum</code> the salaries of the groups.</p>
  517. </div>
  518. </div>
  519. </div>
  520. </div>
  521. <div class="jb_cell">
  522. <div class="cell border-box-sizing code_cell rendered">
  523. <div class="input">
  524. <div class="inner_cell">
  525. <div class="input_area">
  526. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">teams_and_money</span> <span class="o">=</span> <span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;TEAM&#39;</span><span class="p">,</span> <span class="s1">&#39;SALARY&#39;</span><span class="p">)</span>
  527. <span class="n">teams_and_money</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;TEAM&#39;</span><span class="p">,</span> <span class="nb">sum</span><span class="p">)</span>
  528. </pre></div>
  529. </div>
  530. </div>
  531. </div>
  532. <div class="output_wrapper">
  533. <div class="output">
  534. <div class="jb_output_wrapper }}">
  535. <div class="output_area">
  536. <div class="output_html rendered_html output_subarea output_execute_result">
  537. <table border="1" class="dataframe">
  538. <thead>
  539. <tr>
  540. <th>TEAM</th> <th>SALARY sum</th>
  541. </tr>
  542. </thead>
  543. <tbody>
  544. <tr>
  545. <td>Atlanta Hawks </td> <td>69.5731 </td>
  546. </tr>
  547. <tr>
  548. <td>Boston Celtics </td> <td>50.2855 </td>
  549. </tr>
  550. <tr>
  551. <td>Brooklyn Nets </td> <td>57.307 </td>
  552. </tr>
  553. <tr>
  554. <td>Charlotte Hornets </td> <td>84.1024 </td>
  555. </tr>
  556. <tr>
  557. <td>Chicago Bulls </td> <td>78.8209 </td>
  558. </tr>
  559. <tr>
  560. <td>Cleveland Cavaliers </td> <td>102.312 </td>
  561. </tr>
  562. <tr>
  563. <td>Dallas Mavericks </td> <td>65.7626 </td>
  564. </tr>
  565. <tr>
  566. <td>Denver Nuggets </td> <td>62.4294 </td>
  567. </tr>
  568. <tr>
  569. <td>Detroit Pistons </td> <td>42.2118 </td>
  570. </tr>
  571. <tr>
  572. <td>Golden State Warriors</td> <td>94.0851 </td>
  573. </tr>
  574. </tbody>
  575. </table>
  576. <p>... (20 rows omitted)</p>
  577. </div>
  578. </div>
  579. </div>
  580. </div>
  581. </div>
  582. </div>
  583. </div>
  584. <div class="jb_cell">
  585. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  586. <div class="text_cell_render border-box-sizing rendered_html">
  587. <p><strong>2.</strong> How many NBA players were there in each of the five positions?</p>
  588. <p>We have to classify by <code>POSITION</code>, and count. This can be done with just one argument to group:</p>
  589. </div>
  590. </div>
  591. </div>
  592. </div>
  593. <div class="jb_cell">
  594. <div class="cell border-box-sizing code_cell rendered">
  595. <div class="input">
  596. <div class="inner_cell">
  597. <div class="input_area">
  598. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;POSITION&#39;</span><span class="p">)</span>
  599. </pre></div>
  600. </div>
  601. </div>
  602. </div>
  603. <div class="output_wrapper">
  604. <div class="output">
  605. <div class="jb_output_wrapper }}">
  606. <div class="output_area">
  607. <div class="output_html rendered_html output_subarea output_execute_result">
  608. <table border="1" class="dataframe">
  609. <thead>
  610. <tr>
  611. <th>POSITION</th> <th>count</th>
  612. </tr>
  613. </thead>
  614. <tbody>
  615. <tr>
  616. <td>C </td> <td>69 </td>
  617. </tr>
  618. <tr>
  619. <td>PF </td> <td>85 </td>
  620. </tr>
  621. <tr>
  622. <td>PG </td> <td>85 </td>
  623. </tr>
  624. <tr>
  625. <td>SF </td> <td>82 </td>
  626. </tr>
  627. <tr>
  628. <td>SG </td> <td>96 </td>
  629. </tr>
  630. </tbody>
  631. </table>
  632. </div>
  633. </div>
  634. </div>
  635. </div>
  636. </div>
  637. </div>
  638. </div>
  639. <div class="jb_cell">
  640. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  641. <div class="text_cell_render border-box-sizing rendered_html">
  642. <p><strong>3.</strong> What was the average salary of the players at each of the five positions?</p>
  643. <p>This time, we have to group by <code>POSITION</code> and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.</p>
  644. </div>
  645. </div>
  646. </div>
  647. </div>
  648. <div class="jb_cell">
  649. <div class="cell border-box-sizing code_cell rendered">
  650. <div class="input">
  651. <div class="inner_cell">
  652. <div class="input_area">
  653. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">positions_and_money</span> <span class="o">=</span> <span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;POSITION&#39;</span><span class="p">,</span> <span class="s1">&#39;SALARY&#39;</span><span class="p">)</span>
  654. <span class="n">positions_and_money</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;POSITION&#39;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
  655. </pre></div>
  656. </div>
  657. </div>
  658. </div>
  659. <div class="output_wrapper">
  660. <div class="output">
  661. <div class="jb_output_wrapper }}">
  662. <div class="output_area">
  663. <div class="output_html rendered_html output_subarea output_execute_result">
  664. <table border="1" class="dataframe">
  665. <thead>
  666. <tr>
  667. <th>POSITION</th> <th>SALARY mean</th>
  668. </tr>
  669. </thead>
  670. <tbody>
  671. <tr>
  672. <td>C </td> <td>6.08291 </td>
  673. </tr>
  674. <tr>
  675. <td>PF </td> <td>4.95134 </td>
  676. </tr>
  677. <tr>
  678. <td>PG </td> <td>5.16549 </td>
  679. </tr>
  680. <tr>
  681. <td>SF </td> <td>5.53267 </td>
  682. </tr>
  683. <tr>
  684. <td>SG </td> <td>3.9882 </td>
  685. </tr>
  686. </tbody>
  687. </table>
  688. </div>
  689. </div>
  690. </div>
  691. </div>
  692. </div>
  693. </div>
  694. </div>
  695. <div class="jb_cell">
  696. <div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
  697. <div class="text_cell_render border-box-sizing rendered_html">
  698. <p>Center was the most highly paid position, at an average of over 6 million dollars.</p>
  699. <p>If we had not selected the two columns as our first step, <code>group</code> would not attempt to "average" the categorical columns in <code>nba</code>. (It is impossible to average two strings like "Atlanta Hawks" and "Boston Celtics".) It performs arithmetic only on numerical columns and leaves the rest blank.</p>
  700. </div>
  701. </div>
  702. </div>
  703. </div>
  704. <div class="jb_cell">
  705. <div class="cell border-box-sizing code_cell rendered">
  706. <div class="input">
  707. <div class="inner_cell">
  708. <div class="input_area">
  709. <div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">&#39;POSITION&#39;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
  710. </pre></div>
  711. </div>
  712. </div>
  713. </div>
  714. <div class="output_wrapper">
  715. <div class="output">
  716. <div class="jb_output_wrapper }}">
  717. <div class="output_area">
  718. <div class="output_html rendered_html output_subarea output_execute_result">
  719. <table border="1" class="dataframe">
  720. <thead>
  721. <tr>
  722. <th>POSITION</th> <th>PLAYER mean</th> <th>TEAM mean</th> <th>SALARY mean</th>
  723. </tr>
  724. </thead>
  725. <tbody>
  726. <tr>
  727. <td>C </td> <td> </td> <td> </td> <td>6.08291 </td>
  728. </tr>
  729. <tr>
  730. <td>PF </td> <td> </td> <td> </td> <td>4.95134 </td>
  731. </tr>
  732. <tr>
  733. <td>PG </td> <td> </td> <td> </td> <td>5.16549 </td>
  734. </tr>
  735. <tr>
  736. <td>SF </td> <td> </td> <td> </td> <td>5.53267 </td>
  737. </tr>
  738. <tr>
  739. <td>SG </td> <td> </td> <td> </td> <td>3.9882 </td>
  740. </tr>
  741. </tbody>
  742. </table>
  743. </div>
  744. </div>
  745. </div>
  746. </div>
  747. </div>
  748. </div>
  749. </div>