+ "In this example, we will explore statistics for two classic novels: *The Adventures of Huckleberry Finn* by Mark Twain, and *Little Women* by Louisa May Alcott. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the *public domain*, meaning that everyone has the right to copy or use the text in any way. [Project Gutenberg](http://www.gutenberg.org/) is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.\n",
+ "\n",
+ "This example is meant to illustrate some of the broad themes of this text. Don't worry if the details of the program don't yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.\n",
+ "\n",
+ "First, we read the text of both books into lists of chapters, called `huck_finn_chapters` and `little_women_chapters`. In Python, a name cannot contain any spaces, and so we will often use an underscore `_` to stand in for a space. The `=` in the lines below give a name on the left to the result of some computation described on the right. A *uniform resource locator* or *URL* is an address on the Internet for some content; in this case, the text of a book. The `#` symbol starts a comment, which is ignored by the computer but helpful for people reading the code."
+ "While a computer cannot understand the text of a book, it can provide us with some insight into the structure of the text. The name `huck_finn_chapters` is currently bound to a list of all the chapters in the book. We can place them into a table to see how each chapter begins."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Chapters</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>I. YOU don't know about me without you have re...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>II. WE went tiptoeing along a path amongst the...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>III. WELL, I got a good going-over in the morn...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>IV. WELL, three or four months run along, and ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>V. I had shut the door to. Then I turned aroun...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>VI. WELL, pretty soon the old man was up and a...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>VII. \"GIT up! What you 'bout?\" I opened my eye...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>VIII. THE sun was up so high when I waked that...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>IX. I wanted to go and look at a place right a...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>X. AFTER breakfast I wanted to talk about the ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>XI. \"COME in,\" says the woman, and I did. She ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>XII. IT must a been close on to one o'clock wh...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>12</th>\n",
+ " <td>XIII. WELL, I catched my breath and most faint...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>13</th>\n",
+ " <td>XIV. BY and by, when we got up, we turned over...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>14</th>\n",
+ " <td>XV. WE judged that three nights more would fet...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>15</th>\n",
+ " <td>XVI. WE slept most all day, and started out at...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>16</th>\n",
+ " <td>XVII. IN about a minute somebody spoke out of ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>17</th>\n",
+ " <td>XVIII. COL. Grangerford was a gentleman, you s...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18</th>\n",
+ " <td>XIX. TWO or three days and nights went by; I r...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>19</th>\n",
+ " <td>XX. THEY asked us considerable many questions;...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>20</th>\n",
+ " <td>XXI. IT was after sun-up now, but we went righ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>21</th>\n",
+ " <td>XXII. THEY swarmed up towards Sherburn's house...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>22</th>\n",
+ " <td>XXIII. WELL, all day him and the king was hard...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>23</th>\n",
+ " <td>XXIV. NEXT day, towards night, we laid up unde...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>24</th>\n",
+ " <td>XXV. THE news was all over town in two minutes...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>25</th>\n",
+ " <td>XXVI. WELL, when they was all gone the king he...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>26</th>\n",
+ " <td>XXVII. I crept to their doors and listened; th...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>27</th>\n",
+ " <td>XXVIII. BY and by it was getting-up time. So I...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>28</th>\n",
+ " <td>XXIX. THEY was fetching a very nice-looking ol...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>XXX. WHEN they got aboard the king went for me...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>30</th>\n",
+ " <td>XXXI. WE dasn't stop again at any town for day...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>31</th>\n",
+ " <td>XXXII. WHEN I got there it was all still and S...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>32</th>\n",
+ " <td>XXXIII. SO I started for town in the wagon, an...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>33</th>\n",
+ " <td>XXXIV. WE stopped talking, and got to thinking...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>34</th>\n",
+ " <td>XXXV. IT would be most an hour yet till breakf...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>35</th>\n",
+ " <td>XXXVI. AS soon as we reckoned everybody was as...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>36</th>\n",
+ " <td>XXXVII. THAT was all fixed. So then we went aw...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>37</th>\n",
+ " <td>XXXVIII. MAKING them pens was a distressid tou...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>38</th>\n",
+ " <td>XXXIX. IN the morning we went up to the villag...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>39</th>\n",
+ " <td>XL. WE was feeling pretty good after breakfast...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>40</th>\n",
+ " <td>XLI. THE doctor was an old man; a very nice, k...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>41</th>\n",
+ " <td>XLII. THE old man was uptown again before brea...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>42</th>\n",
+ " <td>THE LAST THE first time I catched Tom private ...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Chapters\n",
+ "0 I. YOU don't know about me without you have re...\n",
+ "1 II. WE went tiptoeing along a path amongst the...\n",
+ "2 III. WELL, I got a good going-over in the morn...\n",
+ "3 IV. WELL, three or four months run along, and ...\n",
+ "4 V. I had shut the door to. Then I turned aroun...\n",
+ "5 VI. WELL, pretty soon the old man was up and a...\n",
+ "6 VII. \"GIT up! What you 'bout?\" I opened my eye...\n",
+ "7 VIII. THE sun was up so high when I waked that...\n",
+ "8 IX. I wanted to go and look at a place right a...\n",
+ "9 X. AFTER breakfast I wanted to talk about the ...\n",
+ "10 XI. \"COME in,\" says the woman, and I did. She ...\n",
+ "11 XII. IT must a been close on to one o'clock wh...\n",
+ "12 XIII. WELL, I catched my breath and most faint...\n",
+ "13 XIV. BY and by, when we got up, we turned over...\n",
+ "14 XV. WE judged that three nights more would fet...\n",
+ "15 XVI. WE slept most all day, and started out at...\n",
+ "16 XVII. IN about a minute somebody spoke out of ...\n",
+ "17 XVIII. COL. Grangerford was a gentleman, you s...\n",
+ "18 XIX. TWO or three days and nights went by; I r...\n",
+ "19 XX. THEY asked us considerable many questions;...\n",
+ "20 XXI. IT was after sun-up now, but we went righ...\n",
+ "21 XXII. THEY swarmed up towards Sherburn's house...\n",
+ "22 XXIII. WELL, all day him and the king was hard...\n",
+ "23 XXIV. NEXT day, towards night, we laid up unde...\n",
+ "24 XXV. THE news was all over town in two minutes...\n",
+ "25 XXVI. WELL, when they was all gone the king he...\n",
+ "26 XXVII. I crept to their doors and listened; th...\n",
+ "27 XXVIII. BY and by it was getting-up time. So I...\n",
+ "28 XXIX. THEY was fetching a very nice-looking ol...\n",
+ "29 XXX. WHEN they got aboard the king went for me...\n",
+ "30 XXXI. WE dasn't stop again at any town for day...\n",
+ "31 XXXII. WHEN I got there it was all still and S...\n",
+ "32 XXXIII. SO I started for town in the wagon, an...\n",
+ "33 XXXIV. WE stopped talking, and got to thinking...\n",
+ "34 XXXV. IT would be most an hour yet till breakf...\n",
+ "35 XXXVI. AS soon as we reckoned everybody was as...\n",
+ "36 XXXVII. THAT was all fixed. So then we went aw...\n",
+ "37 XXXVIII. MAKING them pens was a distressid tou...\n",
+ "38 XXXIX. IN the morning we went up to the villag...\n",
+ "39 XL. WE was feeling pretty good after breakfast...\n",
+ "40 XLI. THE doctor was an old man; a very nice, k...\n",
+ "41 XLII. THE old man was uptown again before brea...\n",
+ "42 THE LAST THE first time I catched Tom private ..."
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Display the chapters of Huckleberry Finn in a dataframe.\n",
+ "\n",
+ "pd.DataFrame({'Chapters':huck_finn_chapters})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each chapter begins with a chapter number in Roman numerals, followed by the first sentence of the chapter. Project Gutenberg has printed the first word of each chapter in upper case. "
+ "In this example, we will explore statistics for two classic novels: *The Adventures of Huckleberry Finn* by Mark Twain, and *Little Women* by Louisa May Alcott. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the *public domain*, meaning that everyone has the right to copy or use the text in any way. [Project Gutenberg](http://www.gutenberg.org/) is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.\n",
+ "\n",
+ "This example is meant to illustrate some of the broad themes of this text. Don't worry if the details of the program don't yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.\n",
+ "\n",
+ "First, we read the text of both books into lists of chapters, called `huck_finn_chapters` and `little_women_chapters`. In Python, a name cannot contain any spaces, and so we will often use an underscore `_` to stand in for a space. The `=` in the lines below give a name on the left to the result of some computation described on the right. A *uniform resource locator* or *URL* is an address on the Internet for some content; in this case, the text of a book. The `#` symbol starts a comment, which is ignored by the computer but helpful for people reading the code."
+ "While a computer cannot understand the text of a book, it can provide us with some insight into the structure of the text. The name `huck_finn_chapters` is currently bound to a list of all the chapters in the book. We can place them into a table to see how each chapter begins."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Chapters</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>I. YOU don't know about me without you have re...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>II. WE went tiptoeing along a path amongst the...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>III. WELL, I got a good going-over in the morn...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>IV. WELL, three or four months run along, and ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>V. I had shut the door to. Then I turned aroun...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>VI. WELL, pretty soon the old man was up and a...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>VII. \"GIT up! What you 'bout?\" I opened my eye...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>VIII. THE sun was up so high when I waked that...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>IX. I wanted to go and look at a place right a...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>X. AFTER breakfast I wanted to talk about the ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>XI. \"COME in,\" says the woman, and I did. She ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>XII. IT must a been close on to one o'clock wh...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>12</th>\n",
+ " <td>XIII. WELL, I catched my breath and most faint...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>13</th>\n",
+ " <td>XIV. BY and by, when we got up, we turned over...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>14</th>\n",
+ " <td>XV. WE judged that three nights more would fet...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>15</th>\n",
+ " <td>XVI. WE slept most all day, and started out at...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>16</th>\n",
+ " <td>XVII. IN about a minute somebody spoke out of ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>17</th>\n",
+ " <td>XVIII. COL. Grangerford was a gentleman, you s...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>18</th>\n",
+ " <td>XIX. TWO or three days and nights went by; I r...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>19</th>\n",
+ " <td>XX. THEY asked us considerable many questions;...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>20</th>\n",
+ " <td>XXI. IT was after sun-up now, but we went righ...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>21</th>\n",
+ " <td>XXII. THEY swarmed up towards Sherburn's house...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>22</th>\n",
+ " <td>XXIII. WELL, all day him and the king was hard...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>23</th>\n",
+ " <td>XXIV. NEXT day, towards night, we laid up unde...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>24</th>\n",
+ " <td>XXV. THE news was all over town in two minutes...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>25</th>\n",
+ " <td>XXVI. WELL, when they was all gone the king he...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>26</th>\n",
+ " <td>XXVII. I crept to their doors and listened; th...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>27</th>\n",
+ " <td>XXVIII. BY and by it was getting-up time. So I...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>28</th>\n",
+ " <td>XXIX. THEY was fetching a very nice-looking ol...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>XXX. WHEN they got aboard the king went for me...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>30</th>\n",
+ " <td>XXXI. WE dasn't stop again at any town for day...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>31</th>\n",
+ " <td>XXXII. WHEN I got there it was all still and S...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>32</th>\n",
+ " <td>XXXIII. SO I started for town in the wagon, an...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>33</th>\n",
+ " <td>XXXIV. WE stopped talking, and got to thinking...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>34</th>\n",
+ " <td>XXXV. IT would be most an hour yet till breakf...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>35</th>\n",
+ " <td>XXXVI. AS soon as we reckoned everybody was as...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>36</th>\n",
+ " <td>XXXVII. THAT was all fixed. So then we went aw...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>37</th>\n",
+ " <td>XXXVIII. MAKING them pens was a distressid tou...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>38</th>\n",
+ " <td>XXXIX. IN the morning we went up to the villag...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>39</th>\n",
+ " <td>XL. WE was feeling pretty good after breakfast...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>40</th>\n",
+ " <td>XLI. THE doctor was an old man; a very nice, k...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>41</th>\n",
+ " <td>XLII. THE old man was uptown again before brea...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>42</th>\n",
+ " <td>THE LAST THE first time I catched Tom private ...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Chapters\n",
+ "0 I. YOU don't know about me without you have re...\n",
+ "1 II. WE went tiptoeing along a path amongst the...\n",
+ "2 III. WELL, I got a good going-over in the morn...\n",
+ "3 IV. WELL, three or four months run along, and ...\n",
+ "4 V. I had shut the door to. Then I turned aroun...\n",
+ "5 VI. WELL, pretty soon the old man was up and a...\n",
+ "6 VII. \"GIT up! What you 'bout?\" I opened my eye...\n",
+ "7 VIII. THE sun was up so high when I waked that...\n",
+ "8 IX. I wanted to go and look at a place right a...\n",
+ "9 X. AFTER breakfast I wanted to talk about the ...\n",
+ "10 XI. \"COME in,\" says the woman, and I did. She ...\n",
+ "11 XII. IT must a been close on to one o'clock wh...\n",
+ "12 XIII. WELL, I catched my breath and most faint...\n",
+ "13 XIV. BY and by, when we got up, we turned over...\n",
+ "14 XV. WE judged that three nights more would fet...\n",
+ "15 XVI. WE slept most all day, and started out at...\n",
+ "16 XVII. IN about a minute somebody spoke out of ...\n",
+ "17 XVIII. COL. Grangerford was a gentleman, you s...\n",
+ "18 XIX. TWO or three days and nights went by; I r...\n",
+ "19 XX. THEY asked us considerable many questions;...\n",
+ "20 XXI. IT was after sun-up now, but we went righ...\n",
+ "21 XXII. THEY swarmed up towards Sherburn's house...\n",
+ "22 XXIII. WELL, all day him and the king was hard...\n",
+ "23 XXIV. NEXT day, towards night, we laid up unde...\n",
+ "24 XXV. THE news was all over town in two minutes...\n",
+ "25 XXVI. WELL, when they was all gone the king he...\n",
+ "26 XXVII. I crept to their doors and listened; th...\n",
+ "27 XXVIII. BY and by it was getting-up time. So I...\n",
+ "28 XXIX. THEY was fetching a very nice-looking ol...\n",
+ "29 XXX. WHEN they got aboard the king went for me...\n",
+ "30 XXXI. WE dasn't stop again at any town for day...\n",
+ "31 XXXII. WHEN I got there it was all still and S...\n",
+ "32 XXXIII. SO I started for town in the wagon, an...\n",
+ "33 XXXIV. WE stopped talking, and got to thinking...\n",
+ "34 XXXV. IT would be most an hour yet till breakf...\n",
+ "35 XXXVI. AS soon as we reckoned everybody was as...\n",
+ "36 XXXVII. THAT was all fixed. So then we went aw...\n",
+ "37 XXXVIII. MAKING them pens was a distressid tou...\n",
+ "38 XXXIX. IN the morning we went up to the villag...\n",
+ "39 XL. WE was feeling pretty good after breakfast...\n",
+ "40 XLI. THE doctor was an old man; a very nice, k...\n",
+ "41 XLII. THE old man was uptown again before brea...\n",
+ "42 THE LAST THE first time I catched Tom private ..."
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Display the chapters of Huckleberry Finn in a dataframe.\n",
+ "\n",
+ "pd.DataFrame({'Chapters':huck_finn_chapters})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each chapter begins with a chapter number in Roman numerals, followed by the first sentence of the chapter. Project Gutenberg has printed the first word of each chapter in upper case. "
+Data Science is about drawing useful conclusions from large and diverse data
+sets through exploration, prediction, and inference. Exploration involves
+identifying patterns in information. Prediction involves using information
+we know to make informed guesses about values we wish we knew. Inference
+involves quantifying our degree of certainty: will the patterns that we found in our data also appear in new observations? How accurate are our predictions? Our primary
+tools for exploration are visualizations and descriptive statistics, for
+prediction are machine learning and optimization, and for inference are
+statistical tests and models.
+
+Statistics is a central component of data science because statistics
+studies how to make robust conclusions based on incomplete information. Computing
+is a central component because programming allows us to apply analysis
+techniques to the large and diverse data sets that arise in real-world
+applications: not just numbers, but text, images, videos, and sensor readings.
+Data science is all of these things, but it is more than the sum of its parts
+because of the applications. Through understanding a particular domain, data
+scientists learn to ask appropriate questions about their data and correctly
+interpret the answers provided by our inferential and computational tools.
+# Observation and Visualization: John Snow and the Broad Street Pump
+One of the most powerful examples of astute observation eventually leading to the
+establishment of causality dates back more than 150 years. To get your mind into
+the right timeframe, try to imagine London in the 1850’s. It was the world’s
+wealthiest city but many of its people were desperately poor. Charles Dickens,
+then at the height of his fame, was writing about their plight. Disease was rife
+in the poorer parts of the city, and cholera was among the most feared. It was
+not yet known that germs cause disease; the leading theory was that “miasmas”
+were the main culprit. Miasmas manifested themselves as bad smells, and were
+thought to be invisible poisonous particles arising out of decaying matter.
+Parts of London did smell very bad, especially in hot weather. To protect
+themselves against infection, those who could afford to held sweet-smelling
+things to their noses.
+
+For several years, a doctor by the name of John Snow had been following the
+devastating waves of cholera that hit England from time to time. The disease
+arrived suddenly and was almost immediately deadly: people died within a day or
+two of contracting it, hundreds could die in a week, and the total death toll in
+a single wave could reach tens of thousands. Snow was skeptical of the miasma
+theory. He had noticed that while entire households were wiped out by cholera,
+the people in neighboring houses sometimes remained completely unaffected. As
+they were breathing the same air—and miasmas—as their neighbors, there was no
+compelling association between bad smells and the incidence of cholera.
+
+Snow had also noticed that the onset of the disease almost always involved
+vomiting and diarrhea. He therefore believed that the infection was carried by
+something people ate or drank, not by the air that they breathed. His prime
+suspect was water contaminated by sewage.
+
+At the end of August 1854, cholera struck in the overcrowded Soho district of
+London. As the deaths mounted, Snow recorded them diligently, using a method
+that went on to become standard in the study of how diseases spread: *he drew a
+map*. On a street map of the district, he recorded the location of each death.
+
+Here is Snow’s original map. Each black bar represents one death. When there are multiple deaths at the same address, the bars corresponding to those deaths are stacked on top of each other. The black
+discs mark the locations of water pumps. The map displays a striking
+revelation—the deaths are roughly clustered around the Broad Street pump.
+
+
+Snow studied his map carefully and investigated the apparent anomalies. All of
+them implicated the Broad Street pump. For example:
+- There were deaths in houses that were nearer the Rupert Street pump than the
+ Broad Street pump. Though the Rupert Street pump was closer as the crow flies,
+ it was less convenient to get to because of dead ends and the layout of the
+ streets. The residents in those houses used the Broad Street pump instead.
+- There were no deaths in two blocks just east of the pump. That was the
+ location of the Lion Brewery, where the workers drank what they brewed. If
+ they wanted water, the brewery had its own well.
+- There were scattered deaths in houses several blocks away from the Broad
+ Street pump. Those were children who drank from the Broad Street pump on their
+ way to school. The pump’s water was known to be cool and refreshing.
+
+The final piece of evidence in support of Snow’s theory was provided by two
+isolated deaths in the leafy and genteel Hampstead area, quite far from Soho.
+Snow was puzzled by these until he learned that the deceased were Mrs. Susannah
+Eley, who had once lived in Broad Street, and her niece. Mrs. Eley had water
+from the Broad Street pump delivered to her in Hampstead every day. She liked
+its taste.
+
+Later it was discovered that a cesspit that was just a few feet away from the
+well of the Broad Street pump had been leaking into the well. Thus the pump’s
+water was contaminated by sewage from the houses of cholera victims.
+
+Snow used his map to convince local authorities to remove the handle of the
+Broad Street pump. Though the cholera epidemic was already on the wane when he
+did so, it is possible that the disabling of the pump prevented many deaths from
+future waves of the disease.
+
+The removal of the Broad Street pump handle has become the stuff of legend. At
+the Centers for Disease Control (CDC) in Atlanta, when scientists look for
+simple answers to questions about epidemics, they sometimes ask each other,
+“Where is the handle to this pump?”
+
+Snow’s map is one of the earliest and most powerful uses of data visualization.
+Disease maps of various kinds are now a standard tool for tracking epidemics.
+
+**Towards Causality**
+
+Though the map gave Snow a strong indication that the cleanliness of the water
+supply was the key to controlling cholera, he was still a long way from a
+convincing scientific argument that contaminated water was causing the spread of
+the disease. To make a more compelling case, he had to use the method of
+*comparison*.
+
+Scientists use comparison to identify an association between a treatment and an
+outcome. They compare the outcomes of a group of individuals who got the
+treatment (the *treatment group*) to the outcomes of a group who did not (the
+*control group*). For example, researchers today might compare the average
+murder rate in states that have the death penalty with the average murder rate
+in states that don’t.
+
+If the results are different, that is evidence for an association. To determine
+# Observation and Visualization: John Snow and the Broad Street Pump
+One of the most powerful examples of astute observation eventually leading to the
+establishment of causality dates back more than 150 years. To get your mind into
+the right timeframe, try to imagine London in the 1850’s. It was the world’s
+wealthiest city but many of its people were desperately poor. Charles Dickens,
+then at the height of his fame, was writing about their plight. Disease was rife
+in the poorer parts of the city, and cholera was among the most feared. It was
+not yet known that germs cause disease; the leading theory was that “miasmas”
+were the main culprit. Miasmas manifested themselves as bad smells, and were
+thought to be invisible poisonous particles arising out of decaying matter.
+Parts of London did smell very bad, especially in hot weather. To protect
+themselves against infection, those who could afford to held sweet-smelling
+things to their noses.
+
+For several years, a doctor by the name of John Snow had been following the
+devastating waves of cholera that hit England from time to time. The disease
+arrived suddenly and was almost immediately deadly: people died within a day or
+two of contracting it, hundreds could die in a week, and the total death toll in
+a single wave could reach tens of thousands. Snow was skeptical of the miasma
+theory. He had noticed that while entire households were wiped out by cholera,
+the people in neighboring houses sometimes remained completely unaffected. As
+they were breathing the same air—and miasmas—as their neighbors, there was no
+compelling association between bad smells and the incidence of cholera.
+
+Snow had also noticed that the onset of the disease almost always involved
+vomiting and diarrhea. He therefore believed that the infection was carried by
+something people ate or drank, not by the air that they breathed. His prime
+suspect was water contaminated by sewage.
+
+At the end of August 1854, cholera struck in the overcrowded Soho district of
+London. As the deaths mounted, Snow recorded them diligently, using a method
+that went on to become standard in the study of how diseases spread: *he drew a
+map*. On a street map of the district, he recorded the location of each death.
+
+Here is Snow’s original map. Each black bar represents one death. When there are multiple deaths at the same address, the bars corresponding to those deaths are stacked on top of each other. The black
+discs mark the locations of water pumps. The map displays a striking
+revelation—the deaths are roughly clustered around the Broad Street pump.
+
+
+Snow studied his map carefully and investigated the apparent anomalies. All of
+them implicated the Broad Street pump. For example:
+- There were deaths in houses that were nearer the Rupert Street pump than the
+ Broad Street pump. Though the Rupert Street pump was closer as the crow flies,
+ it was less convenient to get to because of dead ends and the layout of the
+ streets. The residents in those houses used the Broad Street pump instead.
+- There were no deaths in two blocks just east of the pump. That was the
+ location of the Lion Brewery, where the workers drank what they brewed. If
+ they wanted water, the brewery had its own well.
+- There were scattered deaths in houses several blocks away from the Broad
+ Street pump. Those were children who drank from the Broad Street pump on their
+ way to school. The pump’s water was known to be cool and refreshing.
+
+The final piece of evidence in support of Snow’s theory was provided by two
+isolated deaths in the leafy and genteel Hampstead area, quite far from Soho.
+Snow was puzzled by these until he learned that the deceased were Mrs. Susannah
+Eley, who had once lived in Broad Street, and her niece. Mrs. Eley had water
+from the Broad Street pump delivered to her in Hampstead every day. She liked
+its taste.
+
+Later it was discovered that a cesspit that was just a few feet away from the
+well of the Broad Street pump had been leaking into the well. Thus the pump’s
+water was contaminated by sewage from the houses of cholera victims.
+
+Snow used his map to convince local authorities to remove the handle of the
+Broad Street pump. Though the cholera epidemic was already on the wane when he
+did so, it is possible that the disabling of the pump prevented many deaths from
+future waves of the disease.
+
+The removal of the Broad Street pump handle has become the stuff of legend. At
+the Centers for Disease Control (CDC) in Atlanta, when scientists look for
+simple answers to questions about epidemics, they sometimes ask each other,
+“Where is the handle to this pump?”
+
+Snow’s map is one of the earliest and most powerful uses of data visualization.
+Disease maps of various kinds are now a standard tool for tracking epidemics.
+
+**Towards Causality**
+
+Though the map gave Snow a strong indication that the cleanliness of the water
+supply was the key to controlling cholera, he was still a long way from a
+convincing scientific argument that contaminated water was causing the spread of
+the disease. To make a more compelling case, he had to use the method of
+*comparison*.
+
+Scientists use comparison to identify an association between a treatment and an
+outcome. They compare the outcomes of a group of individuals who got the
+treatment (the *treatment group*) to the outcomes of a group who did not (the
+*control group*). For example, researchers today might compare the average
+murder rate in states that have the death penalty with the average murder rate
+in states that don’t.
+
+If the results are different, that is evidence for an association. To determine
+In the language developed earlier in the section, you can think of the people in
+the S&V houses as the treatment group, and those in the Lambeth houses at the
+control group. A crucial element in Snow’s analysis was that the people in the
+two groups were comparable to each other, apart from the treatment.
+
+In order to establish whether it was the water supply that was causing cholera,
+Snow had to compare two groups that were similar to each other in all but one
+aspect—their water supply. Only then would he be able to ascribe the differences
+in their outcomes to the water supply. If the two groups had been different in
+some other way as well, it would have been difficult to point the finger at the
+water supply as the source of the disease. For example, if the treatment group
+consisted of factory workers and the control group did not, then differences
+between the outcomes in the two groups could have been due to the water supply,
+or to factory work, or both. The final picture would have been much more fuzzy.
+
+Snow’s brilliance lay in identifying two groups that would make his comparison
+clear. He had set out to establish a causal relation between contaminated water
+and cholera infection, and to a great extent he succeeded, even though the
+miasmatists ignored and even ridiculed him. Of course, Snow did not understand
+the detailed mechanism by which humans contract cholera. That discovery was made
+in 1883, when the German scientist Robert Koch isolated the *Vibrio cholerae*,
+the bacterium that enters the human small intestine and causes cholera.
+
+In fact the *Vibrio cholerae* had been identified in 1854 by Filippo Pacini in
+Italy, just about when Snow was analyzing his data in London. Because of the
+dominance of the miasmatists in Italy, Pacini’s discovery languished unknown.
+But by the end of the 1800’s, the miasma brigade was in retreat. Subsequent
+history has vindicated Pacini and John Snow. Snow’s methods led to the
+development of the field of *epidemiology*, which is the study of the spread of
+diseases.
+
+**Confounding**
+
+Let us now return to more modern times, armed with an important lesson that we
+have learned along the way:
+
+**In an observational study, if the treatment and control groups differ in ways
+other than the treatment, it is difficult to make conclusions about causality.**
+
+An underlying difference between the two groups (other than the treatment) is
+called a *confounding factor*, because it might confound you (that is, mess you
+up) when you try to reach a conclusion.
+
+**Example: Coffee and lung cancer.** Studies in the 1960’s showed that coffee
+drinkers had higher rates of lung cancer than those who did not drink coffee.
+Because of this, some people identified coffee as a cause of lung cancer. But
+coffee does not cause lung cancer. The analysis contained a confounding factor—smoking. In those days, coffee drinkers were also likely to have been smokers,
+and smoking does cause lung cancer. Coffee drinking was associated with lung
+cancer, but it did not cause the disease.
+
+Confounding factors are common in observational studies. Good studies take great
+care to reduce confounding and to account for its effects.
+In the language developed earlier in the section, you can think of the people in
+the S&V houses as the treatment group, and those in the Lambeth houses at the
+control group. A crucial element in Snow’s analysis was that the people in the
+two groups were comparable to each other, apart from the treatment.
+
+In order to establish whether it was the water supply that was causing cholera,
+Snow had to compare two groups that were similar to each other in all but one
+aspect—their water supply. Only then would he be able to ascribe the differences
+in their outcomes to the water supply. If the two groups had been different in
+some other way as well, it would have been difficult to point the finger at the
+water supply as the source of the disease. For example, if the treatment group
+consisted of factory workers and the control group did not, then differences
+between the outcomes in the two groups could have been due to the water supply,
+or to factory work, or both. The final picture would have been much more fuzzy.
+
+Snow’s brilliance lay in identifying two groups that would make his comparison
+clear. He had set out to establish a causal relation between contaminated water
+and cholera infection, and to a great extent he succeeded, even though the
+miasmatists ignored and even ridiculed him. Of course, Snow did not understand
+the detailed mechanism by which humans contract cholera. That discovery was made
+in 1883, when the German scientist Robert Koch isolated the *Vibrio cholerae*,
+the bacterium that enters the human small intestine and causes cholera.
+
+In fact the *Vibrio cholerae* had been identified in 1854 by Filippo Pacini in
+Italy, just about when Snow was analyzing his data in London. Because of the
+dominance of the miasmatists in Italy, Pacini’s discovery languished unknown.
+But by the end of the 1800’s, the miasma brigade was in retreat. Subsequent
+history has vindicated Pacini and John Snow. Snow’s methods led to the
+development of the field of *epidemiology*, which is the study of the spread of
+diseases.
+
+**Confounding**
+
+Let us now return to more modern times, armed with an important lesson that we
+have learned along the way:
+
+**In an observational study, if the treatment and control groups differ in ways
+other than the treatment, it is difficult to make conclusions about causality.**
+
+An underlying difference between the two groups (other than the treatment) is
+called a *confounding factor*, because it might confound you (that is, mess you
+up) when you try to reach a conclusion.
+
+**Example: Coffee and lung cancer.** Studies in the 1960’s showed that coffee
+drinkers had higher rates of lung cancer than those who did not drink coffee.
+Because of this, some people identified coffee as a cause of lung cancer. But
+coffee does not cause lung cancer. The analysis contained a confounding factor—smoking. In those days, coffee drinkers were also likely to have been smokers,
+and smoking does cause lung cancer. Coffee drinking was associated with lung
+cancer, but it did not cause the disease.
+
+Confounding factors are common in observational studies. Good studies take great
+care to reduce confounding and to account for its effects.
+In the terminology that we have developed, John Snow conducted an
+observational study, not a randomized experiment. But he called his study a
+“grand experiment” because, as he wrote, “No fewer than three hundred thousand
+people … were divided into two groups without their choice, and in most cases,
+without their knowledge …”
+
+Studies such as Snow’s are sometimes called “natural experiments.” However, true
+randomization does not simply mean that the treatment and control groups are
+selected “without their choice.”
+
+The method of randomization can be as simple as tossing a coin. It may also be
+quite a bit more complex. But every method of randomization consists of a
+sequence of carefully defined steps that allow chances to be specified
+mathematically. This has two important consequences.
+
+1. It allows us to account—mathematically—for the possibility that randomization
+ produces treatment and control groups that are quite different from each
+ other.
+
+2. It allows us to make precise mathematical statements about differences
+ between the treatment and control groups. This in turn helps us make
+ justifiable conclusions about whether the treatment has any effect.
+
+
+In this course, you will learn how to conduct and analyze your own randomized
+experiments. That will involve more detail than has been presented in this
+chapter. For now, just focus on the main idea: to try to establish causality,
+run a randomized controlled experiment if possible. If you are conducting an
+observational study, you might be able to establish association but it will be harder to establish causation. Be extremely careful about confounding factors before making
+conclusions about causality based on an observational study.
+
+**Terminology**
+
+* observational study
+* treatment
+* outcome
+* association
+* causal association
+* causality
+* comparison
+* treatment group
+* control group
+* epidemiology
+* confounding
+* randomization
+* randomized controlled experiment
+* randomized controlled trial (RCT)
+* blind
+* placebo
+
+**Fun facts**
+
+1. John Snow is sometimes called the father of epidemiology, but he was an
+ anesthesiologist by profession. One of his patients was Queen Victoria, who
+ was an early recipient of anesthetics during childbirth.
+
+2. Florence Nightingale, the originator of modern nursing practices and famous
+ for her work in the Crimean War, was a die-hard miasmatist. She had no time
+ for theories about contagion and germs, and was not one for mincing her
+ words. “There is no end to the absurdities connected with this doctrine,” she
+ said. “Suffice it to say that in the ordinary sense of the word, there is no
+ proof such as would be admitted in any scientific enquiry that there is any
+ such thing as contagion.”
+
+3. A later RCT established that the conditions on which PROGRESA insisted—children
+ going to school, preventive health care—were not necessary to
+ achieve increased enrollment. Just the financial boost of the welfare
+ payments was sufficient.
+
+
+**Good reads**
+
+[*The Strange Case of the Broad Street Pump: John Snow and the Mystery of
+Cholera*](http://www.ucpress.edu/book.php?isbn=9780520250499) by Sandra Hempel,
+published by our own University of California Press, reads like a whodunit. It
+was one of the main sources for this section's account of John Snow and his
+work. A word of warning: some of the contents of the book are stomach-churning.
+
+[*Poor Economics*](http://www.pooreconomics.com), the best seller by Abhijit Banerjee and Esther Duflo of MIT, is an accessible and lively account of ways to
+fight global poverty. It includes numerous examples of RCTs, including the
+In the terminology that we have developed, John Snow conducted an
+observational study, not a randomized experiment. But he called his study a
+“grand experiment” because, as he wrote, “No fewer than three hundred thousand
+people … were divided into two groups without their choice, and in most cases,
+without their knowledge …”
+
+Studies such as Snow’s are sometimes called “natural experiments.” However, true
+randomization does not simply mean that the treatment and control groups are
+selected “without their choice.”
+
+The method of randomization can be as simple as tossing a coin. It may also be
+quite a bit more complex. But every method of randomization consists of a
+sequence of carefully defined steps that allow chances to be specified
+mathematically. This has two important consequences.
+
+1. It allows us to account—mathematically—for the possibility that randomization
+ produces treatment and control groups that are quite different from each
+ other.
+
+2. It allows us to make precise mathematical statements about differences
+ between the treatment and control groups. This in turn helps us make
+ justifiable conclusions about whether the treatment has any effect.
+
+
+In this course, you will learn how to conduct and analyze your own randomized
+experiments. That will involve more detail than has been presented in this
+chapter. For now, just focus on the main idea: to try to establish causality,
+run a randomized controlled experiment if possible. If you are conducting an
+observational study, you might be able to establish association but it will be harder to establish causation. Be extremely careful about confounding factors before making
+conclusions about causality based on an observational study.
+
+**Terminology**
+
+* observational study
+* treatment
+* outcome
+* association
+* causal association
+* causality
+* comparison
+* treatment group
+* control group
+* epidemiology
+* confounding
+* randomization
+* randomized controlled experiment
+* randomized controlled trial (RCT)
+* blind
+* placebo
+
+**Fun facts**
+
+1. John Snow is sometimes called the father of epidemiology, but he was an
+ anesthesiologist by profession. One of his patients was Queen Victoria, who
+ was an early recipient of anesthetics during childbirth.
+
+2. Florence Nightingale, the originator of modern nursing practices and famous
+ for her work in the Crimean War, was a die-hard miasmatist. She had no time
+ for theories about contagion and germs, and was not one for mincing her
+ words. “There is no end to the absurdities connected with this doctrine,” she
+ said. “Suffice it to say that in the ordinary sense of the word, there is no
+ proof such as would be admitted in any scientific enquiry that there is any
+ such thing as contagion.”
+
+3. A later RCT established that the conditions on which PROGRESA insisted—children
+ going to school, preventive health care—were not necessary to
+ achieve increased enrollment. Just the financial boost of the welfare
+ payments was sufficient.
+
+
+**Good reads**
+
+[*The Strange Case of the Broad Street Pump: John Snow and the Mystery of
+Cholera*](http://www.ucpress.edu/book.php?isbn=9780520250499) by Sandra Hempel,
+published by our own University of California Press, reads like a whodunit. It
+was one of the main sources for this section's account of John Snow and his
+work. A word of warning: some of the contents of the book are stomach-churning.
+
+[*Poor Economics*](http://www.pooreconomics.com), the best seller by Abhijit Banerjee and Esther Duflo of MIT, is an accessible and lively account of ways to
+fight global poverty. It includes numerous examples of RCTs, including the
+*"These problems are, and will probably ever remain, among the inscrutable
+secrets of nature. They belong to a class of questions radically inaccessible to
+the human intelligence."* —The Times of London, September 1849, on how cholera
+is contracted and spread
+
+Does the death penalty have a deterrent effect? Is chocolate good for you? What
+causes breast cancer?
+
+All of these questions attempt to assign a cause to an effect. A careful
+examination of data can help shed light on questions like these. In this section
+you will learn some of the fundamental concepts involved in establishing
+causality.
+
+Observation is a key to good science. An *observational study* is one in which
+scientists make conclusions based on data that they have observed but had no
+hand in generating. In data science, many such studies involve observations on a
+group of individuals, a factor of interest called a *treatment*, and an
+*outcome* measured on each individual.
+
+It is easiest to think of the individuals as people. In a study of whether
+chocolate is good for the health, the individuals would indeed be people, the
+treatment would be eating chocolate, and the outcome might be a measure of heart disease. But individuals in observational studies need not be people. In a
+study of whether the death penalty has a deterrent effect, the individuals could
+be the 50 states of the union. A state law allowing the death penalty would be
+the treatment, and an outcome could be the state’s murder rate.
+
+The fundamental question is whether the treatment has an effect on the outcome.
+Any relation between the treatment and the outcome is called an *association*.
+If the treatment causes the outcome to occur, then the association is *causal*.
+*Causality* is at the heart of all three questions posed at the start of this
+section. For example, one of the questions was whether chocolate directly causes
+improvements in health, not just whether there there is a relation between
+chocolate and health.
+
+The establishment of causality often takes place in two stages. First, an
+association is observed. Next, a more careful analysis leads to a decision about
+ "Programming languages are much simpler than human languages. Nonetheless, there are some rules of grammar to learn in any language, and that is where we will begin. In this text, we will use the [Python](https://www.python.org/) programming language. Learning the grammar rules is essential, and the same rules used in the most basic programs are also central to more sophisticated programs.\n",
+ "\n",
+ "Programs are made up of *expressions*, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a `*` symbol between two numerical expressions. Expressions, such as `3 * 4`, are *evaluated* by the computer. The value (the result of *evaluation*) of the last expression in each cell, `12` in this case, is displayed below the cell."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "12"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 * 4"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The grammar rules of a programming language are rigid. In Python, the `*` symbol cannot appear twice in a row. The computer will not try to interpret an expression that differs from its prescribed expression structures. Instead, it will show a `SyntaxError` error. The *Syntax* of a language is its set of grammar rules, and a `SyntaxError` indicates that an expression structure doesn't match any of the rules of the language."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": [
+ "raises-exception"
+ ]
+ },
+ "outputs": [
+ {
+ "ename": "SyntaxError",
+ "evalue": "invalid syntax (<ipython-input-2-012ea60b41dd>, line 1)",
+ "Small changes to an expression can change its meaning entirely. Below, the space between the `*`'s has been removed. Because `**` appears between two numerical expressions, the expression is a well-formed *exponentiation* expression (the first number raised to the power of the second: 3 times 3 times 3 times 3). The symbols `*` and `**` are called *operators*, and the values they combine are called *operands*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "81"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 ** 4"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Common Operators.** Data science often involves combining numerical values, and the set of operators in a programming language are designed to so that expressions can be used to express any sort of arithmetic. In Python, the following operators are essential.\n",
+ "\n",
+ "| Expression Type | Operator | Example | Value |\n",
+ "Python expressions obey the same familiar rules of *precedence* as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "17.555555555555557"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2017.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This chapter introduces many types of expressions. Learning to program involves trying out everything you learn in combination, investigating the behavior of the computer. What happens if you divide by zero? What happens if you divide twice in a row? You don't always need to ask an expert (or the Internet); many of these details can be discovered by trying them out yourself. "
+ "Programming languages are much simpler than human languages. Nonetheless, there are some rules of grammar to learn in any language, and that is where we will begin. In this text, we will use the [Python](https://www.python.org/) programming language. Learning the grammar rules is essential, and the same rules used in the most basic programs are also central to more sophisticated programs.\n",
+ "\n",
+ "Programs are made up of *expressions*, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a `*` symbol between two numerical expressions. Expressions, such as `3 * 4`, are *evaluated* by the computer. The value (the result of *evaluation*) of the last expression in each cell, `12` in this case, is displayed below the cell."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "12"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 * 4"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The grammar rules of a programming language are rigid. In Python, the `*` symbol cannot appear twice in a row. The computer will not try to interpret an expression that differs from its prescribed expression structures. Instead, it will show a `SyntaxError` error. The *Syntax* of a language is its set of grammar rules, and a `SyntaxError` indicates that an expression structure doesn't match any of the rules of the language."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": [
+ "raises-exception"
+ ]
+ },
+ "outputs": [
+ {
+ "ename": "SyntaxError",
+ "evalue": "invalid syntax (<ipython-input-2-012ea60b41dd>, line 1)",
+ "Small changes to an expression can change its meaning entirely. Below, the space between the `*`'s has been removed. Because `**` appears between two numerical expressions, the expression is a well-formed *exponentiation* expression (the first number raised to the power of the second: 3 times 3 times 3 times 3). The symbols `*` and `**` are called *operators*, and the values they combine are called *operands*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "81"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 ** 4"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Common Operators.** Data science often involves combining numerical values, and the set of operators in a programming language are designed to so that expressions can be used to express any sort of arithmetic. In Python, the following operators are essential.\n",
+ "\n",
+ "| Expression Type | Operator | Example | Value |\n",
+ "Python expressions obey the same familiar rules of *precedence* as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "17.555555555555557"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2017.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This chapter introduces many types of expressions. Learning to program involves trying out everything you learn in combination, investigating the behavior of the computer. What happens if you divide by zero? What happens if you divide twice in a row? You don't always need to ask an expert (or the Internet); many of these details can be discovered by trying them out yourself. "
+ "Names are given to values in Python using an *assignment* statement. In an assignment, a name is followed by `=`, which is followed by any expression. The value of the expression to the right of `=` is *assigned* to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "30"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "a = 10\n",
+ "b = 20\n",
+ "a + b"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A previously assigned name can be used in the expression to the right of `=`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.5"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "quarter = 1/4\n",
+ "half = 2 * quarter\n",
+ "half"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, only the current value of an expression is assigned to a name. If that value changes later, names that were defined in terms of that value will not change automatically."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.5"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "quarter = 4\n",
+ "half"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore character `_` to replace each space. Names are only as useful as you make them; it's up to the programmer to choose names that are easy to interpret. Typically, more meaningful names can be invented than `a` and `b`. For example, to describe the sales tax on a $5 purchase in Berkeley, CA, the following names clarify the meaning of the various quantities involved."
+ "The relationship between two measurements of the same quantity taken at different times is often expressed as a *growth rate*. For example, the United States federal government [employed](http://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm) 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the `initial` amount. For values over time, the earlier value is a natural choice. Then, we divide the difference between the `changed` and `initial` amount by the `initial` amount."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.01735357917570499"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2766000\n",
+ "changed = 2814000\n",
+ "(changed - initial) / initial"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It is also typical to subtract one from the ratio of the two measurements, which yields the same value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.017353579175704903"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This value is the growth rate over 10 years. A useful property of growth rates is that they don't change even if the values are expressed in different units. So, for example, we can express the same relationship between thousands of people in 2002 and 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.017353579175704903"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2766\n",
+ "changed = 2814\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In 10 years, the number of employees of the US Federal Government has increased by only 1.74%. In that time, the total expenditures of the US Federal Government increased from \\$2.37 trillion to \\$3.38 trillion in 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.4261603375527425"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2.37\n",
+ "changed = 3.38\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A 42.6% increase in the federal budget is much larger than the 1.74% increase in federal employees. In fact, the number of federal employees has grown much more slowly than the population of the United States, which increased 9.21% in the same time period from 287.6 million people in 2002 to 314.1 million in 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.09214186369958277"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 287.6\n",
+ "changed = 314.1\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3 million in 2002 to 11.9 million in 2012, a -22.2% growth rate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-0.2222222222222222"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 15.3\n",
+ "changed = 11.9\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives a much larger ten-year growth rate of 0.41 (or 41%)."
+ "Likewise, a ten-year growth rate can be used to compute an equivalent annual growth rate. Below, `t` is the number of years that have passed between measurements. The following computes the annual growth rate of federal expenditures over the last 10 years."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.03613617208346853"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2.37\n",
+ "changed = 3.38\n",
+ "t = 10\n",
+ "(changed/initial) ** (1/t) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The total growth over 10 years is equivalent to a 3.6% increase each year."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In summary, a growth rate `g` is used to describe the relative size of an `initial` amount and a `changed` amount after some amount of time `t`. To compute $changed$, apply the growth rate `g` repeatedly, `t` times using exponentiation.\n",
+ "\n",
+ "`initial * (1 + g) ** t`\n",
+ "\n",
+ "To compute `g`, raise the total growth to the power of `1/t` and subtract one.\n",
+ "The relationship between two measurements of the same quantity taken at different times is often expressed as a *growth rate*. For example, the United States federal government [employed](http://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm) 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the `initial` amount. For values over time, the earlier value is a natural choice. Then, we divide the difference between the `changed` and `initial` amount by the `initial` amount."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.01735357917570499"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2766000\n",
+ "changed = 2814000\n",
+ "(changed - initial) / initial"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It is also typical to subtract one from the ratio of the two measurements, which yields the same value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.017353579175704903"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This value is the growth rate over 10 years. A useful property of growth rates is that they don't change even if the values are expressed in different units. So, for example, we can express the same relationship between thousands of people in 2002 and 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.017353579175704903"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2766\n",
+ "changed = 2814\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In 10 years, the number of employees of the US Federal Government has increased by only 1.74%. In that time, the total expenditures of the US Federal Government increased from \\$2.37 trillion to \\$3.38 trillion in 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.4261603375527425"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2.37\n",
+ "changed = 3.38\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A 42.6% increase in the federal budget is much larger than the 1.74% increase in federal employees. In fact, the number of federal employees has grown much more slowly than the population of the United States, which increased 9.21% in the same time period from 287.6 million people in 2002 to 314.1 million in 2012."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.09214186369958277"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 287.6\n",
+ "changed = 314.1\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3 million in 2002 to 11.9 million in 2012, a -22.2% growth rate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-0.2222222222222222"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 15.3\n",
+ "changed = 11.9\n",
+ "(changed/initial) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives a much larger ten-year growth rate of 0.41 (or 41%)."
+ "Likewise, a ten-year growth rate can be used to compute an equivalent annual growth rate. Below, `t` is the number of years that have passed between measurements. The following computes the annual growth rate of federal expenditures over the last 10 years."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.03613617208346853"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "initial = 2.37\n",
+ "changed = 3.38\n",
+ "t = 10\n",
+ "(changed/initial) ** (1/t) - 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The total growth over 10 years is equivalent to a 3.6% increase each year."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In summary, a growth rate `g` is used to describe the relative size of an `initial` amount and a `changed` amount after some amount of time `t`. To compute $changed$, apply the growth rate `g` repeatedly, `t` times using exponentiation.\n",
+ "\n",
+ "`initial * (1 + g) ** t`\n",
+ "\n",
+ "To compute `g`, raise the total growth to the power of `1/t` and subtract one.\n",
+ "Names are given to values in Python using an *assignment* statement. In an assignment, a name is followed by `=`, which is followed by any expression. The value of the expression to the right of `=` is *assigned* to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "30"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "a = 10\n",
+ "b = 20\n",
+ "a + b"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A previously assigned name can be used in the expression to the right of `=`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.5"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "quarter = 1/4\n",
+ "half = 2 * quarter\n",
+ "half"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, only the current value of an expression is assigned to a name. If that value changes later, names that were defined in terms of that value will not change automatically."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.5"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "quarter = 4\n",
+ "half"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore character `_` to replace each space. Names are only as useful as you make them; it's up to the programmer to choose names that are easy to interpret. Typically, more meaningful names can be invented than `a` and `b`. For example, to describe the sales tax on a $5 purchase in Berkeley, CA, the following names clarify the meaning of the various quantities involved."
+ "*Call expressions* invoke functions, which are named operations. The name of the function appears first, followed by expressions in parentheses. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "12"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "abs(-12)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "round(5 - 1.3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "5"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "max(2, 2 + 3, 4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this last example, the `max` function is *called* on three *arguments*: 2, 5, and 4. The value of each expression within parentheses is passed to the function, and the function *returns* the final value of the full call expression. The `max` function can take any number of arguments and returns the maximum.\n",
+ "\n",
+ "A few functions are available by default, such as `abs` and `round`, but most functions that are built into the Python language are stored in a collection of functions called a *module*. An *import statement* is used to provide access to a module, such as `math` or `operator`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import math\n",
+ "import operator\n",
+ "math.sqrt(operator.add(4, 5))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An equivalent expression could be expressed using the `+` and `**` operators instead."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(4 + 5) ** 0.5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Operators and call expressions can be used together in an expression. The *percent difference* between two values is used to compare values for which neither one is obviously `initial` or `changed`. For example, in 2014 Florida farms produced 2.72 billion eggs while Iowa farms produced 16.25 billion eggs (http://quickstats.nass.usda.gov/). The percent difference is 100 times the absolute value of the difference between the values, divided by their average. In this case, the difference is larger than the average, and so the percent difference is greater than 100."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "142.6462836056932"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "florida = 2.72\n",
+ "iowa = 16.25\n",
+ "100*abs(florida-iowa)/((florida+iowa)/2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Learning how different functions behave is an important part of learning a programming language. A Jupyter notebook can assist in remembering the names and effects of different functions. When editing a code cell, press the *tab* key after typing the beginning of a name to bring up a list of ways to complete that name. For example, press *tab* after `math.` to see all of the functions available in the `math` module. Typing will narrow down the list of options. To learn more about a function, place a `?` after its name. For example, typing `math.log?` will bring up a description of the `log` function in the `math` module."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "math.log?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " log(x[, base])\n",
+ "\n",
+ " Return the logarithm of x to the given base.\n",
+ " If the base not specified, returns the natural logarithm (base e) of x."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The square brackets in the example call indicate that an argument is optional. That is, `log` can be called with either one or two arguments."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.0"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "math.log(16, 2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.0"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "math.log(16)/math.log(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The list of [Python's built-in functions](https://docs.python.org/3/library/functions.html) is quite long and includes many functions that are never needed in data science applications. The list of [mathematical functions in the `math` module](https://docs.python.org/3/library/math.html) is similarly long. This text will introduce the most important functions in context, rather than expecting the reader to memorize or understand these lists."
+ "*Call expressions* invoke functions, which are named operations. The name of the function appears first, followed by expressions in parentheses. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "12"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "abs(-12)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "round(5 - 1.3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "5"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "max(2, 2 + 3, 4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this last example, the `max` function is *called* on three *arguments*: 2, 5, and 4. The value of each expression within parentheses is passed to the function, and the function *returns* the final value of the full call expression. The `max` function can take any number of arguments and returns the maximum.\n",
+ "\n",
+ "A few functions are available by default, such as `abs` and `round`, but most functions that are built into the Python language are stored in a collection of functions called a *module*. An *import statement* is used to provide access to a module, such as `math` or `operator`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import math\n",
+ "import operator\n",
+ "math.sqrt(operator.add(4, 5))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An equivalent expression could be expressed using the `+` and `**` operators instead."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(4 + 5) ** 0.5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Operators and call expressions can be used together in an expression. The *percent difference* between two values is used to compare values for which neither one is obviously `initial` or `changed`. For example, in 2014 Florida farms produced 2.72 billion eggs while Iowa farms produced 16.25 billion eggs (http://quickstats.nass.usda.gov/). The percent difference is 100 times the absolute value of the difference between the values, divided by their average. In this case, the difference is larger than the average, and so the percent difference is greater than 100."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "142.6462836056932"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "florida = 2.72\n",
+ "iowa = 16.25\n",
+ "100*abs(florida-iowa)/((florida+iowa)/2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Learning how different functions behave is an important part of learning a programming language. A Jupyter notebook can assist in remembering the names and effects of different functions. When editing a code cell, press the *tab* key after typing the beginning of a name to bring up a list of ways to complete that name. For example, press *tab* after `math.` to see all of the functions available in the `math` module. Typing will narrow down the list of options. To learn more about a function, place a `?` after its name. For example, typing `math.log?` will bring up a description of the `log` function in the `math` module."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "math.log?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " log(x[, base])\n",
+ "\n",
+ " Return the logarithm of x to the given base.\n",
+ " If the base not specified, returns the natural logarithm (base e) of x."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The square brackets in the example call indicate that an argument is optional. That is, `log` can be called with either one or two arguments."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.0"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "math.log(16, 2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.0"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "math.log(16)/math.log(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The list of [Python's built-in functions](https://docs.python.org/3/library/functions.html) is quite long and includes many functions that are never needed in data science applications. The list of [mathematical functions in the `math` module](https://docs.python.org/3/library/math.html) is similarly long. This text will introduce the most important functions in context, rather than expecting the reader to memorize or understand these lists."
+ "We can now apply Python to analyze data. We will work with data stored in DataFrame structures.\n",
+ "\n",
+ "A DataFrames (df) is a fundamental way of representing data sets. A df can be viewed in two ways:\n",
+ "* a sequence of named columns that each describe a single attribute of all entries in a data set, or\n",
+ "* a sequence of rows that each contain all information about a single individual in a data set.\n",
+ "\n",
+ "We will study dfs in great detail in the next several chapters. For now, we will just introduce a few methods without going into technical details. \n",
+ "\n",
+ "The df `cones` has been imported for us; later we will see how, but here we will just work with it. First, let's take a look at it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The DataFrame has six rows. Each row corresponds to one ice cream cone. The ice cream cones are the *individuals*.\n",
+ "\n",
+ "Each cone has three attributes: flavor, color, and price. Each column contains the data on one of these attributes, and so all the entries of any single column are of the same kind. Each column has a label. We will refer to columns by their labels.\n",
+ "\n",
+ "A df method is just like a function, but it must operate on a df. So the call looks like\n",
+ "\n",
+ "`name_of_DataFrame.method(arguments)`\n",
+ "\n",
+ "For example, if you want to see just the first two rows of a df, you can use the df method `head`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.head(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can replace 2 by any number of rows. If you ask for more than six, you will only get six, because `cones` only has six rows."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Choosing Sets of Columns\n",
+ "The method `select` creates a new table consisting of only the specified columns.\n",
+ "We can state which columns we want to view by using dot '.' notation (not he same as in maths) or hard brackets with quotes. Note that an index is automatically generated, this is a fundamental aspect of the DataFrame as the index allows us to 'locate' members of the DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum\n",
+ "Name: Flavor, dtype: object"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# single square brackets\n",
+ "\n",
+ "cones['Flavor']\n",
+ "\n",
+ "# uncomment (remove the hash mark) the line below to view the 'type()' of the output\n",
+ "\n",
+ "#type(cones['Flavor'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor\n",
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# double square brackets\n",
+ "\n",
+ "cones[['Flavor']]\n",
+ "\n",
+ "# uncomment the line below to view the 'type()' of the output\n",
+ "\n",
+ "# type(cones[['Flavor']])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum\n",
+ "Name: Flavor, dtype: object"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.Flavor"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>pink</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25\n",
+ "5 bubblegum pink 4.75"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can select more than one column, by separating the column labels by commas. When you wish to view more than one column the 'hard brackets' must be used twice."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[['Flavor', 'Price']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can also *drop* columns you don't want. The table above can be created by dropping the `Color` column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.drop(columns=['Color'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can name this new table and look at it again by just typing its name."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "no_colors = cones.drop(columns=['Color'])\n",
+ "\n",
+ "no_colors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Like selecting columns using hard brackets or dot notation, the `drop` method creates a smaller table and leaves the original table unchanged. In order to explore your data, you can create any number of smaller tables by using choosing or dropping columns. It will do no harm to your original data table."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Sorting Rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `sort_values` method creates a new table by arranging the rows of the original table in ascending order of the values in the specified column. Here the `cones` table has been sorted in ascending order of the price of the cones.\n",
+ "To sort in descending order, you can use an *optional* argument to `sort`. As the name implies, optional arguments don't have to be used, but they can be used if you want to change the default behavior of a method. \n",
+ "\n",
+ "By default, `sort` sorts in increasing order of the values in the specified column. To sort in decreasing order, use the optional argument `ascending=False`, the default value for `ascending` is `True`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>pink</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25\n",
+ "1 chocolate light brown 4.75\n",
+ "5 bubblegum pink 4.75\n",
+ "0 strawberry pink 3.55"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.sort_values('Price', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As when selecting and `drop`ing the `sort` method leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Selecting Rows that Satisfy a Condition\n",
+ "Creating a new DataFrame (in database world this wold be a 'view'), consisting only of the rows that satisfy a given condition we use the 'exactly equal to' `==`. In this section we will work with a very simple condition, which is that the value in a specified column must be exactly equal to a value that we also specify. Thus the `==` method has two arguments.\n",
+ "\n",
+ "The code in the cell below creates a df consisting only of the rows corresponding to chocolate cones."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "4 chocolate dark brown 5.25"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[cones['Flavor']=='chocolate']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The arguments are the label of the column and the value we are looking for in that column. The `==` method can also be used when the condition that the rows must satisfy is more complicated. In those situations the call will be a little more complicated as well.\n",
+ "\n",
+ "It is important to provide the value exactly. For example, if we specify `Chocolate` instead of `chocolate`, then `where` correctly finds no rows where the flavor is `Chocolate`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [Flavor, Color, Price]\n",
+ "Index: []"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[cones['Flavor'] == 'Chocolate']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Like all the other table methods in this section, `==` leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: Salaries in the NBA"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\"The NBA is the highest paying professional sports league in the world,\" [reported CNN](http://edition.cnn.com/2015/12/04/sport/gallery/highest-paid-nba-players/) in March 2016. The table `nba` contains the [salaries of all National Basketball Association players](https://www.statcrunch.com/app/index.php?dataid=1843341) in 2015-2016.\n",
+ "\n",
+ "Each row represents one player. The columns are:\n",
+ "|`SALARY` | Player's salary in 2015-2016, in millions of dollars|\n",
+ " \n",
+ "The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn't involve details about how basketball is played.\n",
+ "\n",
+ "The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\\$18.7$ million in 2015-2016."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Fans of Stephen Curry can find his row by using `where`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['PLAYER'] == 'Stephen Curry']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also create a new table called `warriors` consisting of just the data for the Golden State Warriors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "warriors = nba[nba['TEAM'] =='Golden State Warriors']\n",
+ "warriors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "By default, the first 10 lines of a table are displayed. You can use `head()` to display more or fewer. To display the entire table type the name of the DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "warriors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `nba` table is sorted in alphabetical order of the team names. To see how the players were paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the sorting is in increasing order."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>325</th>\n",
+ " <td>Phil Pressey</td>\n",
+ " <td>PG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "325 Phil Pressey PG Phoenix Suns 0.055722\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. \n",
+ "\n",
+ "The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. To identify these players we can sort in descending order of salary and look at the top few rows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>200</th>\n",
+ " <td>Elliot Williams</td>\n",
+ " <td>SG</td>\n",
+ " <td>Memphis Grizzlies</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ ".. ... ... ... ...\n",
+ "200 Elliot Williams SG Memphis Grizzlies 0.055722\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Kobe Bryant, since retired, was the highest earning NBA player in 2015-2016."
+ "We can now apply Python to analyze data. We will work with data stored in DataFrame structures.\n",
+ "\n",
+ "A DataFrames (df) is a fundamental way of representing data sets. A df can be viewed in two ways:\n",
+ "* a sequence of named columns that each describe a single attribute of all entries in a data set, or\n",
+ "* a sequence of rows that each contain all information about a single individual in a data set.\n",
+ "\n",
+ "We will study dfs in great detail in the next several chapters. For now, we will just introduce a few methods without going into technical details. \n",
+ "\n",
+ "The df `cones` has been imported for us; later we will see how, but here we will just work with it. First, let's take a look at it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The DataFrame has six rows. Each row corresponds to one ice cream cone. The ice cream cones are the *individuals*.\n",
+ "\n",
+ "Each cone has three attributes: flavor, color, and price. Each column contains the data on one of these attributes, and so all the entries of any single column are of the same kind. Each column has a label. We will refer to columns by their labels.\n",
+ "\n",
+ "A df method is just like a function, but it must operate on a df. So the call looks like\n",
+ "\n",
+ "`name_of_DataFrame.method(arguments)`\n",
+ "\n",
+ "For example, if you want to see just the first two rows of a df, you can use the df method `head`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.head(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can replace 2 by any number of rows. If you ask for more than six, you will only get six, because `cones` only has six rows."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Choosing Sets of Columns\n",
+ "The method `select` creates a new table consisting of only the specified columns.\n",
+ "We can state which columns we want to view by using dot '.' notation (not he same as in maths) or hard brackets with quotes. Note that an index is automatically generated, this is a fundamental aspect of the DataFrame as the index allows us to 'locate' members of the DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum\n",
+ "Name: Flavor, dtype: object"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# single square brackets\n",
+ "\n",
+ "cones['Flavor']\n",
+ "\n",
+ "# uncomment (remove the hash mark) the line below to view the 'type()' of the output\n",
+ "\n",
+ "#type(cones['Flavor'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor\n",
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# double square brackets\n",
+ "\n",
+ "cones[['Flavor']]\n",
+ "\n",
+ "# uncomment the line below to view the 'type()' of the output\n",
+ "\n",
+ "# type(cones[['Flavor']])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 strawberry\n",
+ "1 chocolate\n",
+ "2 chocolate\n",
+ "3 strawberry\n",
+ "4 chocolate\n",
+ "5 bubblegum\n",
+ "Name: Flavor, dtype: object"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.Flavor"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>pink</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "0 strawberry pink 3.55\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25\n",
+ "5 bubblegum pink 4.75"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can select more than one column, by separating the column labels by commas. When you wish to view more than one column the 'hard brackets' must be used twice."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[['Flavor', 'Price']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can also *drop* columns you don't want. The table above can be created by dropping the `Color` column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.drop(columns=['Color'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can name this new table and look at it again by just typing its name."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Price\n",
+ "0 strawberry 3.55\n",
+ "1 chocolate 4.75\n",
+ "2 chocolate 5.25\n",
+ "3 strawberry 5.25\n",
+ "4 chocolate 5.25\n",
+ "5 bubblegum 4.75"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "no_colors = cones.drop(columns=['Color'])\n",
+ "\n",
+ "no_colors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Like selecting columns using hard brackets or dot notation, the `drop` method creates a smaller table and leaves the original table unchanged. In order to explore your data, you can create any number of smaller tables by using choosing or dropping columns. It will do no harm to your original data table."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Sorting Rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `sort_values` method creates a new table by arranging the rows of the original table in ascending order of the values in the specified column. Here the `cones` table has been sorted in ascending order of the price of the cones.\n",
+ "To sort in descending order, you can use an *optional* argument to `sort`. As the name implies, optional arguments don't have to be used, but they can be used if you want to change the default behavior of a method. \n",
+ "\n",
+ "By default, `sort` sorts in increasing order of the values in the specified column. To sort in decreasing order, use the optional argument `ascending=False`, the default value for `ascending` is `True`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>bubblegum</td>\n",
+ " <td>pink</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>pink</td>\n",
+ " <td>3.55</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "2 chocolate dark brown 5.25\n",
+ "3 strawberry pink 5.25\n",
+ "4 chocolate dark brown 5.25\n",
+ "1 chocolate light brown 4.75\n",
+ "5 bubblegum pink 4.75\n",
+ "0 strawberry pink 3.55"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones.sort_values('Price', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As when selecting and `drop`ing the `sort` method leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Selecting Rows that Satisfy a Condition\n",
+ "Creating a new DataFrame (in database world this wold be a 'view'), consisting only of the rows that satisfy a given condition we use the 'exactly equal to' `==`. In this section we will work with a very simple condition, which is that the value in a specified column must be exactly equal to a value that we also specify. Thus the `==` method has two arguments.\n",
+ "\n",
+ "The code in the cell below creates a df consisting only of the rows corresponding to chocolate cones."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>light brown</td>\n",
+ " <td>4.75</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>dark brown</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Color Price\n",
+ "1 chocolate light brown 4.75\n",
+ "2 chocolate dark brown 5.25\n",
+ "4 chocolate dark brown 5.25"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[cones['Flavor']=='chocolate']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The arguments are the label of the column and the value we are looking for in that column. The `==` method can also be used when the condition that the rows must satisfy is more complicated. In those situations the call will be a little more complicated as well.\n",
+ "\n",
+ "It is important to provide the value exactly. For example, if we specify `Chocolate` instead of `chocolate`, then `where` correctly finds no rows where the flavor is `Chocolate`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Color</th>\n",
+ " <th>Price</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [Flavor, Color, Price]\n",
+ "Index: []"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cones[cones['Flavor'] == 'Chocolate']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Like all the other table methods in this section, `==` leaves the original table unchanged."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: Salaries in the NBA"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\"The NBA is the highest paying professional sports league in the world,\" [reported CNN](http://edition.cnn.com/2015/12/04/sport/gallery/highest-paid-nba-players/) in March 2016. The table `nba` contains the [salaries of all National Basketball Association players](https://www.statcrunch.com/app/index.php?dataid=1843341) in 2015-2016.\n",
+ "\n",
+ "Each row represents one player. The columns are:\n",
+ "|`SALARY` | Player's salary in 2015-2016, in millions of dollars|\n",
+ " \n",
+ "The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn't involve details about how basketball is played.\n",
+ "\n",
+ "The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\\$18.7$ million in 2015-2016."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Fans of Stephen Curry can find his row by using `where`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['PLAYER'] == 'Stephen Curry']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also create a new table called `warriors` consisting of just the data for the Golden State Warriors."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "warriors = nba[nba['TEAM'] =='Golden State Warriors']\n",
+ "warriors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "By default, the first 10 lines of a table are displayed. You can use `head()` to display more or fewer. To display the entire table type the name of the DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "warriors"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `nba` table is sorted in alphabetical order of the team names. To see how the players were paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the sorting is in increasing order."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>325</th>\n",
+ " <td>Phil Pressey</td>\n",
+ " <td>PG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "325 Phil Pressey PG Phoenix Suns 0.055722\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. \n",
+ "\n",
+ "The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. To identify these players we can sort in descending order of salary and look at the top few rows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>200</th>\n",
+ " <td>Elliot Williams</td>\n",
+ " <td>SG</td>\n",
+ " <td>Memphis Grizzlies</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ ".. ... ... ... ...\n",
+ "200 Elliot Williams SG Memphis Grizzlies 0.055722\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Kobe Bryant, since retired, was the highest earning NBA player in 2015-2016."
+ "Computers are designed to perform numerical calculations, but there are some important details about working with numbers that every programmer working with quantitative data should know. Python (and most other programming languages) distinguishes between two different types of numbers:\n",
+ "\n",
+ "* Integers are called `int` values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don't have a fractional component\n",
+ "* Real numbers are called `float` values (or *floating point values*) in the Python language. They can represent whole or fractional numbers but have some limitations.\n",
+ "\n",
+ "The type of a number is evident from the way it is displayed: `int` values have no decimal point and `float` values always have a decimal point. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Some int values\n",
+ "2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-1234567890000000000"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "-1234567890000000000"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.2"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Some float values\n",
+ "1.2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When a `float` value is combined with an `int` value using some arithmetic operator, then the result is always a `float` value. In most cases, two integers combine to form another integer, but any number (`int` or `float`) divided by another will be a `float` value. Very large or very small `float` values are displayed using scientific notation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.5"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1.5 + 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 / 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-1.23456789e+19"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "-12345678900000000000.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `type` function can be used to find the type of any number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "int"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "float"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(3 / 1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `type` of an expression is the type of its final value. So, the `type` function will never indicate that the type of an expression is a name, because names are always evaluated to their assigned values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "int"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = 3\n",
+ "type(x) # The type of x is an int, not a name"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "float"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(x + 2.5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## More About Float Values\n",
+ "\n",
+ "Float values are very flexible, but they do have limits. \n",
+ "\n",
+ "1. A `float` can represent extremely large and extremely small numbers. There are limits, but you will rarely encounter them.\n",
+ "2. A `float` only represents 15 or 16 significant digits for any number; the remaining precision is lost. This limited precision is enough for the vast majority of applications.\n",
+ "3. After combining `float` values with arithmetic, the last few digits may be incorrect. Small rounding errors are often confusing when first encountered.\n",
+ "\n",
+ "The first limit can be observed in two ways. If the result of a computation is a very large number, then it is represented as infinite. If the result is a very small number, then it is represented as zero."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2e+307"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e306 * 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "inf"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e306 * 100"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2e-323"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e-322 / 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.0"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e-322 / 100"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The second limit can be observed by an expression that involves numbers with more than 15 significant digits. These extra digits are discarded before any arithmetic is carried out."
+ "The third limit can be observed when taking the difference between two expressions that should be equivalent. For example, the expression `2 ** 0.5` computes the square root of 2, but squaring this value does not exactly recover 2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.4142135623730951"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2 ** 0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2.0000000000000004"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(2 ** 0.5) * (2 ** 0.5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.440892098500626e-16"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(2 ** 0.5) * (2 ** 0.5) - 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The final result above is `0.0000000000000004440892098500626`, a number that is very close to zero. The correct answer to this arithmetic expression is 0, but a small error in the final significant digit appears very different in scientific notation. This behavior appears in almost all programming languages because it is the result of the standard way that arithmetic is carried out on computers. \n",
+ "\n",
+ "Although `float` values are not always exact, they are certainly reliable and work the same way across all different kinds of computers and programming languages. "
+ "Computers are designed to perform numerical calculations, but there are some important details about working with numbers that every programmer working with quantitative data should know. Python (and most other programming languages) distinguishes between two different types of numbers:\n",
+ "\n",
+ "* Integers are called `int` values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don't have a fractional component\n",
+ "* Real numbers are called `float` values (or *floating point values*) in the Python language. They can represent whole or fractional numbers but have some limitations.\n",
+ "\n",
+ "The type of a number is evident from the way it is displayed: `int` values have no decimal point and `float` values always have a decimal point. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Some int values\n",
+ "2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 + 3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-1234567890000000000"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "-1234567890000000000"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.2"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Some float values\n",
+ "1.2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When a `float` value is combined with an `int` value using some arithmetic operator, then the result is always a `float` value. In most cases, two integers combine to form another integer, but any number (`int` or `float`) divided by another will be a `float` value. Very large or very small `float` values are displayed using scientific notation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.5"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1.5 + 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.0"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 / 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-1.23456789e+19"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "-12345678900000000000.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `type` function can be used to find the type of any number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "int"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "float"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(3 / 1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `type` of an expression is the type of its final value. So, the `type` function will never indicate that the type of an expression is a name, because names are always evaluated to their assigned values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "int"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = 3\n",
+ "type(x) # The type of x is an int, not a name"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "float"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(x + 2.5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## More About Float Values\n",
+ "\n",
+ "Float values are very flexible, but they do have limits. \n",
+ "\n",
+ "1. A `float` can represent extremely large and extremely small numbers. There are limits, but you will rarely encounter them.\n",
+ "2. A `float` only represents 15 or 16 significant digits for any number; the remaining precision is lost. This limited precision is enough for the vast majority of applications.\n",
+ "3. After combining `float` values with arithmetic, the last few digits may be incorrect. Small rounding errors are often confusing when first encountered.\n",
+ "\n",
+ "The first limit can be observed in two ways. If the result of a computation is a very large number, then it is represented as infinite. If the result is a very small number, then it is represented as zero."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2e+307"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e306 * 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "inf"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e306 * 100"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2e-323"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e-322 / 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.0"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2e-322 / 100"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The second limit can be observed by an expression that involves numbers with more than 15 significant digits. These extra digits are discarded before any arithmetic is carried out."
+ "The third limit can be observed when taking the difference between two expressions that should be equivalent. For example, the expression `2 ** 0.5` computes the square root of 2, but squaring this value does not exactly recover 2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.4142135623730951"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "2 ** 0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2.0000000000000004"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(2 ** 0.5) * (2 ** 0.5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4.440892098500626e-16"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(2 ** 0.5) * (2 ** 0.5) - 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The final result above is `0.0000000000000004440892098500626`, a number that is very close to zero. The correct answer to this arithmetic expression is 0, but a small error in the final significant digit appears very different in scientific notation. This behavior appears in almost all programming languages because it is the result of the standard way that arithmetic is carried out on computers. \n",
+ "\n",
+ "Although `float` values are not always exact, they are certainly reliable and work the same way across all different kinds of computers and programming languages. "
+ "Much of the world's data is text, and a piece of text represented in a computer is called a *string*. A string can represent a word, a sentence, or even the contents of every book in a library. Since text can include numbers (like this: 5) or truth values (True), a string can also describe those things.\n",
+ "\n",
+ "The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings together produces another string. This expression is still an addition expression, but it is combining a different type of value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'datascience'"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"data\" + \"science\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Addition is completely literal; it combines these two strings together without regard for their contents. It doesn't add a space because these are different words; that's up to the programmer (you) to specify."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'data science'"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"data\" + \" \" + \"science\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Single and double quotes can both be used to create strings: `'hi'` and `\"hi\"` are identical expressions. Double quotes are often preferred because they allow you to include apostrophes inside of strings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"This won't work with a single-quoted string!\""
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"This won't work with a single-quoted string!\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Why not? Try it out."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `str` function returns a string representation of any value. Using this function, strings can be constructed that have embedded values."
+ "From an existing string, related strings can be constructed using string methods, which are functions that operate on strings. These methods are called by placing a dot after the string, then calling the function.\n",
+ "\n",
+ "For example, the following method generates an uppercased version of a string."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'LOUD'"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"loud\".upper()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Perhaps the most important method is `replace`, which replaces all instances of a substring within the string. The `replace` method takes two arguments, the text to be replaced and its replacement."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'matchmaker'"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "'hitchhiker'.replace('hi', 'ma')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "String methods can also be invoked using variable names, as long as those names are bound to strings. So, for instance, the following two-step process generates the word \"degrade\" starting from \"train\" by first creating \"ingrain\" and then applying a second replacement."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'degrade'"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "s = \"train\"\n",
+ "t = s.replace('t', 'ing')\n",
+ "u = t.replace('in', 'de')\n",
+ "u"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that the line `t = s.replace('t', 'ing')` doesn't change the string `s`, which is still \"train\". The method call `s.replace('t', 'ing')` just has a value, which is the string \"ingrain\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'train'"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "s"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is the first time we've seen methods, but methods are not unique to strings. As we will see shortly, other types of objects can have them."
+ "From an existing string, related strings can be constructed using string methods, which are functions that operate on strings. These methods are called by placing a dot after the string, then calling the function.\n",
+ "\n",
+ "For example, the following method generates an uppercased version of a string."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'LOUD'"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"loud\".upper()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Perhaps the most important method is `replace`, which replaces all instances of a substring within the string. The `replace` method takes two arguments, the text to be replaced and its replacement."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'matchmaker'"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "'hitchhiker'.replace('hi', 'ma')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "String methods can also be invoked using variable names, as long as those names are bound to strings. So, for instance, the following two-step process generates the word \"degrade\" starting from \"train\" by first creating \"ingrain\" and then applying a second replacement."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'degrade'"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "s = \"train\"\n",
+ "t = s.replace('t', 'ing')\n",
+ "u = t.replace('in', 'de')\n",
+ "u"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that the line `t = s.replace('t', 'ing')` doesn't change the string `s`, which is still \"train\". The method call `s.replace('t', 'ing')` just has a value, which is the string \"ingrain\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'train'"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "s"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is the first time we've seen methods, but methods are not unique to strings. As we will see shortly, other types of objects can have them."
+ "Much of the world's data is text, and a piece of text represented in a computer is called a *string*. A string can represent a word, a sentence, or even the contents of every book in a library. Since text can include numbers (like this: 5) or truth values (True), a string can also describe those things.\n",
+ "\n",
+ "The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings together produces another string. This expression is still an addition expression, but it is combining a different type of value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'datascience'"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"data\" + \"science\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Addition is completely literal; it combines these two strings together without regard for their contents. It doesn't add a space because these are different words; that's up to the programmer (you) to specify."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'data science'"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"data\" + \" \" + \"science\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Single and double quotes can both be used to create strings: `'hi'` and `\"hi\"` are identical expressions. Double quotes are often preferred because they allow you to include apostrophes inside of strings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"This won't work with a single-quoted string!\""
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"This won't work with a single-quoted string!\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Why not? Try it out."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `str` function returns a string representation of any value. Using this function, strings can be constructed that have embedded values."
+ "Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, `3` is larger than `1 + 1`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 > 1 + 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The value `True` indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between `3` and `1+1`. The full set of common comparison operators are listed below.\n",
+ "\n",
+ "| Comparison | Operator | True example | False Example |\n",
+ "An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be `True`. For example, we can express that `1+1` is between `1` and `3` using the following expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 < 1 + 1 < 3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers `x` and `y` below. You can try different values of `x` and `y` to confirm this relationship."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = 12\n",
+ "y = 5\n",
+ "min(x, y) <= (x+y)/2 <= max(x, y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string."
+ "Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, `3` is larger than `1 + 1`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 > 1 + 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The value `True` indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between `3` and `1+1`. The full set of common comparison operators are listed below.\n",
+ "\n",
+ "| Comparison | Operator | True example | False Example |\n",
+ "An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be `True`. For example, we can express that `1+1` is between `1` and `3` using the following expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 < 1 + 1 < 3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers `x` and `y` below. You can try different values of `x` and `y` to confirm this relationship."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = 12\n",
+ "y = 5\n",
+ "min(x, y) <= (x+y)/2 <= max(x, y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string."
+ "Every value has a type, and the built-in `type` function returns the type of the result of any expression."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "One type we have encountered already is a built-in function. Python indicates that the type is a `builtin_function_or_method`; the distinction between a *function* and a *method* is not important at this stage."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "type(abs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This chapter will explore many useful types of data."
+ "While there are many kinds of collections in Python, we will work primarily with arrays in this class. We've already seen that the `make_array` function can be used to create arrays of numbers.\n",
+ "\n",
+ "Arrays can also contain strings or other types of values, but a single array can only contain a single kind of data. (It usually doesn't make sense to group together unlike data anyway.) For example:"
+ "Returning to the temperature data, we create arrays of average daily [high temperatures](http://berkeleyearth.lbl.gov/auto/Regional/TMAX/Text/global-land-TMAX-Trend.txt) for the decades surrounding 1850, 1900, 1950, and 2000."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([13.6 , 14.387, 14.585, 15.164])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "baseline_high = 14.48\n",
+ "highs = np.array([baseline_high - 0.880, \n",
+ " baseline_high - 0.093,\n",
+ " baseline_high + 0.105, \n",
+ " baseline_high + 0.684])\n",
+ "highs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Arrays can be used in arithmetic expressions to compute over their contents. When an array is combined with a single number, that number is combined with each element of the array. Therefore, we can convert all of these temperatures to Fahrenheit by writing the familiar conversion formula."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([56.48 , 57.8966, 58.253 , 59.2952])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(9/5) * highs + 32"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"array_arithmetic.png\" />"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Arrays also have *methods*, which are functions that operate on the array values. The `mean` of a collection of numbers is its average value: the sum divided by the length. Each pair of parentheses in the examples below is part of a call expression; it's calling a function with no arguments to perform a computation on the array called `highs`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.size"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "57.736000000000004"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "14.434000000000001"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.mean()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Functions on Arrays\n",
+ "The `numpy` package, abbreviated `np` in programs, provides Python programmers with convenient and powerful functions for creating and manipulating arrays."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For example, the `diff` function computes the difference between each adjacent pair of elements in an array. The first element of the `diff` is the second element minus the first. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0.787, 0.198, 0.579])"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.diff(highs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The [full Numpy reference](http://docs.scipy.org/doc/numpy/reference/) lists these functions exhaustively, but only a small subset are used commonly for data processing applications. These are grouped into different packages within `np`. Learning this vocabulary is an important part of learning the Python language, so refer back to this list often as you work through examples and problems.\n",
+ "\n",
+ "However, you **don't need to memorize these**. Use this as a reference.\n",
+ "\n",
+ "Each of these functions takes an array as an argument and returns a single value.\n",
+ "While there are many kinds of collections in Python, we will work primarily with arrays in this class. We've already seen that the `make_array` function can be used to create arrays of numbers.\n",
+ "\n",
+ "Arrays can also contain strings or other types of values, but a single array can only contain a single kind of data. (It usually doesn't make sense to group together unlike data anyway.) For example:"
+ "Returning to the temperature data, we create arrays of average daily [high temperatures](http://berkeleyearth.lbl.gov/auto/Regional/TMAX/Text/global-land-TMAX-Trend.txt) for the decades surrounding 1850, 1900, 1950, and 2000."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([13.6 , 14.387, 14.585, 15.164])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "baseline_high = 14.48\n",
+ "highs = np.array([baseline_high - 0.880, \n",
+ " baseline_high - 0.093,\n",
+ " baseline_high + 0.105, \n",
+ " baseline_high + 0.684])\n",
+ "highs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Arrays can be used in arithmetic expressions to compute over their contents. When an array is combined with a single number, that number is combined with each element of the array. Therefore, we can convert all of these temperatures to Fahrenheit by writing the familiar conversion formula."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([56.48 , 57.8966, 58.253 , 59.2952])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(9/5) * highs + 32"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"array_arithmetic.png\" />"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Arrays also have *methods*, which are functions that operate on the array values. The `mean` of a collection of numbers is its average value: the sum divided by the length. Each pair of parentheses in the examples below is part of a call expression; it's calling a function with no arguments to perform a computation on the array called `highs`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.size"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "57.736000000000004"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "14.434000000000001"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs.mean()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Functions on Arrays\n",
+ "The `numpy` package, abbreviated `np` in programs, provides Python programmers with convenient and powerful functions for creating and manipulating arrays."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For example, the `diff` function computes the difference between each adjacent pair of elements in an array. The first element of the `diff` is the second element minus the first. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0.787, 0.198, 0.579])"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.diff(highs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The [full Numpy reference](http://docs.scipy.org/doc/numpy/reference/) lists these functions exhaustively, but only a small subset are used commonly for data processing applications. These are grouped into different packages within `np`. Learning this vocabulary is an important part of learning the Python language, so refer back to this list often as you work through examples and problems.\n",
+ "\n",
+ "However, you **don't need to memorize these**. Use this as a reference.\n",
+ "\n",
+ "Each of these functions takes an array as an argument and returns a single value.\n",
+ "A *range* is an array of numbers in increasing or decreasing order, each separated by a regular interval. \n",
+ "Ranges are useful in a surprisingly large number of situations, so it's worthwhile to learn about them.\n",
+ "\n",
+ "Ranges are defined using the `np.arange` function, which takes either one, two, or three arguments: a start, and end, and a 'step'.\n",
+ "\n",
+ "If you pass one argument to `np.arange`, this becomes the `end` value, with `start=0`, `step=1` assumed. Two arguments give the `start` and `end` with `step=1` assumed. Three arguments give the `start`, `end` and `step` explicitly.\n",
+ "\n",
+ "A range always includes its `start` value, but does not include its `end` value. It counts up by `step`, and it stops before it gets to the `end`.\n",
+ "\n",
+ " np.arange(end): An array starting with 0 of increasing consecutive integers, stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4])"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice how the array starts at 0 and goes only up to 4, not to the end value of 5."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " np.arange(start, end): An array of consecutive increasing integers from start, stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([3, 4, 5, 6, 7, 8])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(3, 9)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " np.arange(start, end, step): A range with a difference of step between each pair of consecutive values, starting from start and stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([ 3, 8, 13, 18, 23, 28])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(3, 30, 5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This array starts at 3, then takes a step of 5 to get to 8, then another step of 5 to get to 13, and so on.\n",
+ "\n",
+ "When you specify a step, the start, end, and step can all be either positive or negative and may be whole numbers or fractions. "
+ "Though some math is needed to establish this, we can use arrays to convince ourselves that the formula works. Let's calculate the first 5000 terms of Leibniz's infinite sum and see if it is close to $\\pi$.\n",
+ "We will calculate this finite sum by adding all the positive terms first and then subtracting the sum of all the negative terms [[1]](#footnotes):\n",
+ "This is very close to $\\pi = 3.14159\\dots$. Leibniz's formula is looking good!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a id='footnotes'></a>\n",
+ "### Footnotes\n",
+ "[1] Surprisingly, when we add *infinitely* many fractions, the order can matter! But our approximation to $\\pi$ uses only a large finite number of fractions, so it's okay to add the terms in any convenient order."
+ "A *range* is an array of numbers in increasing or decreasing order, each separated by a regular interval. \n",
+ "Ranges are useful in a surprisingly large number of situations, so it's worthwhile to learn about them.\n",
+ "\n",
+ "Ranges are defined using the `np.arange` function, which takes either one, two, or three arguments: a start, and end, and a 'step'.\n",
+ "\n",
+ "If you pass one argument to `np.arange`, this becomes the `end` value, with `start=0`, `step=1` assumed. Two arguments give the `start` and `end` with `step=1` assumed. Three arguments give the `start`, `end` and `step` explicitly.\n",
+ "\n",
+ "A range always includes its `start` value, but does not include its `end` value. It counts up by `step`, and it stops before it gets to the `end`.\n",
+ "\n",
+ " np.arange(end): An array starting with 0 of increasing consecutive integers, stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4])"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice how the array starts at 0 and goes only up to 4, not to the end value of 5."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " np.arange(start, end): An array of consecutive increasing integers from start, stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([3, 4, 5, 6, 7, 8])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(3, 9)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " np.arange(start, end, step): A range with a difference of step between each pair of consecutive values, starting from start and stopping before end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([ 3, 8, 13, 18, 23, 28])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.arange(3, 30, 5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This array starts at 3, then takes a step of 5 to get to 8, then another step of 5 to get to 13, and so on.\n",
+ "\n",
+ "When you specify a step, the start, end, and step can all be either positive or negative and may be whole numbers or fractions. "
+ "Though some math is needed to establish this, we can use arrays to convince ourselves that the formula works. Let's calculate the first 5000 terms of Leibniz's infinite sum and see if it is close to $\\pi$.\n",
+ "We will calculate this finite sum by adding all the positive terms first and then subtracting the sum of all the negative terms [[1]](#footnotes):\n",
+ "This is very close to $\\pi = 3.14159\\dots$. Leibniz's formula is looking good!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a id='footnotes'></a>\n",
+ "### Footnotes\n",
+ "[1] Surprisingly, when we add *infinitely* many fractions, the order can matter! But our approximation to $\\pi$ uses only a large finite number of fractions, so it's okay to add the terms in any convenient order."
+ "It's often necessary to compute something that involves data from more than one array. If two arrays are of the same size, Python makes it easy to do calculations involving both arrays.\n",
+ "\n",
+ "For our first example, we return once more to the temperature data. This time, we create arrays of average daily [high](http://berkeleyearth.lbl.gov/auto/Regional/TMAX/Text/global-land-TMAX-Trend.txt) and [low](http://berkeleyearth.lbl.gov/auto/Regional/TMIN/Text/global-land-TMIN-Trend.txt) temperatures for the decades surrounding 1850, 1900, 1950, and 2000."
+ "Suppose we'd like to compute the average daily *range* of temperatures for each decade. That is, we want to subtract the average daily high in the 1850s from the average daily low in the 1850s, and the same for each other decade.\n",
+ "\n",
+ "We could write this laboriously using `.item`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([11.472, 12.016, 11.711, 11.436])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# note - though the '.item()' can be used with numpy but not with pandas\n",
+ "# you can mix lines of pandas code with numpy code\n",
+ "\n",
+ "np.array(\n",
+ " [highs.item(0) - lows.item(0),\n",
+ " highs.item(1) - lows.item(1),\n",
+ " highs.item(2) - lows.item(2),\n",
+ " highs.item(3) - lows.item(3)]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As when we converted an array of temperatures from Celsius to Fahrenheit, Python provides a much cleaner way to write this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([11.472, 12.016, 11.711, 11.436])"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs - lows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"array_subtraction.png\" />"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What we've seen in these examples are special cases of a general feature of arrays."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Elementwise arithmetic on pairs of numerical arrays\n",
+ "If an arithmetic operator acts on two arrays of the same size, then the operation is performed on each corresponding pair of elements in the two arrays. The final result is an array. \n",
+ "\n",
+ "For example, if `array1` and `array2` have the same number of elements, then the value of `array1 * array2` is an array. Its first element is the first element of `array1` times the first element of `array2`, its second element is the second element of `array1` times the second element of `array2`, and so on."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Example: Wallis' Formula for $\\pi$ \n",
+ "The number $\\pi$ is important in many different areas of math. Centuries before computers were invented, mathematicians worked on finding simple ways to approximate the numerical value of $\\pi$. We have already seen Leibniz's formula for $\\pi$. About half a century before Leibniz, the English mathematician [John Wallis](https://en.wikipedia.org/wiki/John_Wallis) (1616-1703) also expressed $\\pi$ in terms of simple fractions, as an infinite product.\n",
+ "We're now ready to do the calculation. We start by creating an array of even numbers 2, 4, 6, and so on upto 1,000,000. Then we create two lists of odd numbers: 1, 3, 5, 7, ... upto 999,999, and 3, 5, 7, ... upto 1,000,001."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "even = np.arange(2, 1000001, 2)\n",
+ "one_below_even = even - 1\n",
+ "one_above_even = even + 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Remember that `np.prod` multiplies all the elements of an array together. Now we can calculate Wallis' product, to a good approximation."
+ "That's $\\pi$ correct to five decimal places. Wallis clearly came up with a great formula."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a id='footnotes'></a>\n",
+ "### Footnotes\n",
+ "[1] As we saw in the example about Leibniz's formula, when we add *infinitely* many fractions, the order can matter. The same is true with multiplying fractions, as we are doing here. But our approximation to $\\pi$ uses only a large finite number of fractions, so it's okay to multiply the terms in any convenient order."
+ "It's often necessary to compute something that involves data from more than one array. If two arrays are of the same size, Python makes it easy to do calculations involving both arrays.\n",
+ "\n",
+ "For our first example, we return once more to the temperature data. This time, we create arrays of average daily [high](http://berkeleyearth.lbl.gov/auto/Regional/TMAX/Text/global-land-TMAX-Trend.txt) and [low](http://berkeleyearth.lbl.gov/auto/Regional/TMIN/Text/global-land-TMIN-Trend.txt) temperatures for the decades surrounding 1850, 1900, 1950, and 2000."
+ "Suppose we'd like to compute the average daily *range* of temperatures for each decade. That is, we want to subtract the average daily high in the 1850s from the average daily low in the 1850s, and the same for each other decade.\n",
+ "\n",
+ "We could write this laboriously using `.item`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([11.472, 12.016, 11.711, 11.436])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# note - though the '.item()' can be used with numpy but not with pandas\n",
+ "# you can mix lines of pandas code with numpy code\n",
+ "\n",
+ "np.array(\n",
+ " [highs.item(0) - lows.item(0),\n",
+ " highs.item(1) - lows.item(1),\n",
+ " highs.item(2) - lows.item(2),\n",
+ " highs.item(3) - lows.item(3)]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As when we converted an array of temperatures from Celsius to Fahrenheit, Python provides a much cleaner way to write this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([11.472, 12.016, 11.711, 11.436])"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "highs - lows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<img src=\"array_subtraction.png\" />"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What we've seen in these examples are special cases of a general feature of arrays."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Elementwise arithmetic on pairs of numerical arrays\n",
+ "If an arithmetic operator acts on two arrays of the same size, then the operation is performed on each corresponding pair of elements in the two arrays. The final result is an array. \n",
+ "\n",
+ "For example, if `array1` and `array2` have the same number of elements, then the value of `array1 * array2` is an array. Its first element is the first element of `array1` times the first element of `array2`, its second element is the second element of `array1` times the second element of `array2`, and so on."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Example: Wallis' Formula for $\\pi$ \n",
+ "The number $\\pi$ is important in many different areas of math. Centuries before computers were invented, mathematicians worked on finding simple ways to approximate the numerical value of $\\pi$. We have already seen Leibniz's formula for $\\pi$. About half a century before Leibniz, the English mathematician [John Wallis](https://en.wikipedia.org/wiki/John_Wallis) (1616-1703) also expressed $\\pi$ in terms of simple fractions, as an infinite product.\n",
+ "We're now ready to do the calculation. We start by creating an array of even numbers 2, 4, 6, and so on upto 1,000,000. Then we create two lists of odd numbers: 1, 3, 5, 7, ... upto 999,999, and 3, 5, 7, ... upto 1,000,001."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "even = np.arange(2, 1000001, 2)\n",
+ "one_below_even = even - 1\n",
+ "one_above_even = even + 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Remember that `np.prod` multiplies all the elements of an array together. Now we can calculate Wallis' product, to a good approximation."
+ "That's $\\pi$ correct to five decimal places. Wallis clearly came up with a great formula."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<a id='footnotes'></a>\n",
+ "### Footnotes\n",
+ "[1] As we saw in the example about Leibniz's formula, when we add *infinitely* many fractions, the order can matter. The same is true with multiplying fractions, as we are doing here. But our approximation to $\\pi$ uses only a large finite number of fractions, so it's okay to multiply the terms in any convenient order."
+ "Values can be grouped together into collections, which allows programmers to organize those values and refer to all of them with a single name. By grouping values together, we can write code that performs a computation on many pieces of data at once.\n",
+ "\n",
+ "Calling the function `np.array` on several values places them into an *array*, which is a kind of sequential collection. Below, we collect four different temperatures into an array called `highs`. These are the [estimated average daily high temperatures](http://berkeleyearth.lbl.gov/regions/global-land) over all land on Earth (in degrees Celsius) for the decades surrounding 1850, 1900, 1950, and 2000, respectively, expressed as deviations from the average absolute high temperature between 1951 and 1980, which was 14.48 degrees."
+ "Collections allow us to pass multiple values into a function using a single name. For instance, the `sum` function computes the sum of all values in a collection, and the `len` function computes its length. (That's the number of values we put in it.) Using them together, we can compute the average of a collection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "14.434000000000001"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sum(highs)/len(highs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The complete chart of daily high and low temperatures appears below. \n",
+ "\n",
+ "## Mean of Daily High Temperature\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Mean of Daily Low Temperature\n",
+ "\n",
+ ""
+ "\"The NBA is the highest paying professional sports league in the world,\" [reported CNN](http://edition.cnn.com/2015/12/04/sport/gallery/highest-paid-nba-players/) in March 2016. The table `nba_salaries` contains the salaries of all National Basketball Association players in 2015-2016.\n",
+ "\n",
+ "Each row represents one player. The columns are:\n",
+ "|`'15-'16 SALARY` | Player's salary in 2015-2016, in millions of dollars|\n",
+ " \n",
+ "The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn't involve details about how basketball is played.\n",
+ "\n",
+ "The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\\$18.7$ million in 2015-2016."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341\n",
+ "The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The `show` method allows us to specify the number of rows, with the default (no specification) being all the rows of the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba_salaries.head(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It's also possible to list the same rows in alphabetical order by player name using the `sort` method. The argument to `sort` is a column label or index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>68</th>\n",
+ " <td>Aaron Brooks</td>\n",
+ " <td>PG</td>\n",
+ " <td>Chicago Bulls</td>\n",
+ " <td>2.250000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>291</th>\n",
+ " <td>Aaron Gordon</td>\n",
+ " <td>PF</td>\n",
+ " <td>Orlando Magic</td>\n",
+ " <td>4.171680</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>59</th>\n",
+ " <td>Aaron Harrison</td>\n",
+ " <td>SG</td>\n",
+ " <td>Charlotte Hornets</td>\n",
+ " <td>0.525093</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>235</th>\n",
+ " <td>Adreian Payne</td>\n",
+ " <td>PF</td>\n",
+ " <td>Minnesota Timberwolves</td>\n",
+ " <td>1.938840</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "68 Aaron Brooks PG Chicago Bulls 2.250000\n",
+ "291 Aaron Gordon PF Orlando Magic 4.171680\n",
+ "59 Aaron Harrison SG Charlotte Hornets 0.525093\n",
+ "To examine the players' salaries, it would be much more helpful if the data were ordered by salary.\n",
+ "\n",
+ "To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label `SALARY`. \n",
+ "\n",
+ "This arranges all the rows of the table in *increasing* order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>325</th>\n",
+ " <td>Phil Pressey</td>\n",
+ " <td>PG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "325 Phil Pressey PG Phoenix Suns 0.055722\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors. \n",
+ "\n",
+ "The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. \n",
+ "\n",
+ "To order the rows of the table in *decreasing* order of salary, we must use `sort` with the option `ascending=False`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>200</th>\n",
+ " <td>Elliot Williams</td>\n",
+ " <td>SG</td>\n",
+ " <td>Memphis Grizzlies</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ ".. ... ... ... ...\n",
+ "200 Elliot Williams SG Memphis Grizzlies 0.055722\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of $\\$25$ million. Notice that the MVP Stephen Curry doesn't appear among the top 10. He is quite a bit further down the list, as we will see later."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Named Arguments\n",
+ "\n",
+ "The `descending=True` portion of this call expression is called a *named argument*. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Help on method sort_values in module pandas.core.frame:\n",
+ "This describes the positions, names, and default values of the three arguments to `sort_value`. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.\n",
+ "When an argument is simply `True` or `False`, it's a useful convention to include the argument name so that it's more obvious what the argument value means."
+ "\"The NBA is the highest paying professional sports league in the world,\" [reported CNN](http://edition.cnn.com/2015/12/04/sport/gallery/highest-paid-nba-players/) in March 2016. The table `nba_salaries` contains the salaries of all National Basketball Association players in 2015-2016.\n",
+ "\n",
+ "Each row represents one player. The columns are:\n",
+ "|`'15-'16 SALARY` | Player's salary in 2015-2016, in millions of dollars|\n",
+ " \n",
+ "The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn't involve details about how basketball is played.\n",
+ "\n",
+ "The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\\$18.7$ million in 2015-2016."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341\n",
+ "The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The `show` method allows us to specify the number of rows, with the default (no specification) being all the rows of the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba_salaries.head(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It's also possible to list the same rows in alphabetical order by player name using the `sort` method. The argument to `sort` is a column label or index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>'15-'16 SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>68</th>\n",
+ " <td>Aaron Brooks</td>\n",
+ " <td>PG</td>\n",
+ " <td>Chicago Bulls</td>\n",
+ " <td>2.250000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>291</th>\n",
+ " <td>Aaron Gordon</td>\n",
+ " <td>PF</td>\n",
+ " <td>Orlando Magic</td>\n",
+ " <td>4.171680</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>59</th>\n",
+ " <td>Aaron Harrison</td>\n",
+ " <td>SG</td>\n",
+ " <td>Charlotte Hornets</td>\n",
+ " <td>0.525093</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>235</th>\n",
+ " <td>Adreian Payne</td>\n",
+ " <td>PF</td>\n",
+ " <td>Minnesota Timberwolves</td>\n",
+ " <td>1.938840</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM '15-'16 SALARY\n",
+ "68 Aaron Brooks PG Chicago Bulls 2.250000\n",
+ "291 Aaron Gordon PF Orlando Magic 4.171680\n",
+ "59 Aaron Harrison SG Charlotte Hornets 0.525093\n",
+ "To examine the players' salaries, it would be much more helpful if the data were ordered by salary.\n",
+ "\n",
+ "To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label `SALARY`. \n",
+ "\n",
+ "This arranges all the rows of the table in *increasing* order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>325</th>\n",
+ " <td>Phil Pressey</td>\n",
+ " <td>PG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "325 Phil Pressey PG Phoenix Suns 0.055722\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors. \n",
+ "\n",
+ "The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. \n",
+ "\n",
+ "To order the rows of the table in *decreasing* order of salary, we must use `sort` with the option `ascending=False`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>200</th>\n",
+ " <td>Elliot Williams</td>\n",
+ " <td>SG</td>\n",
+ " <td>Memphis Grizzlies</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>324</th>\n",
+ " <td>Orlando Johnson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.055722</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>327</th>\n",
+ " <td>Cory Jefferson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>326</th>\n",
+ " <td>Jordan McRae</td>\n",
+ " <td>SG</td>\n",
+ " <td>Phoenix Suns</td>\n",
+ " <td>0.049709</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>267</th>\n",
+ " <td>Thanasis Antetokounmpo</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>0.030888</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ ".. ... ... ... ...\n",
+ "200 Elliot Williams SG Memphis Grizzlies 0.055722\n",
+ "324 Orlando Johnson SG Phoenix Suns 0.055722\n",
+ "327 Cory Jefferson PF Phoenix Suns 0.049709\n",
+ "326 Jordan McRae SG Phoenix Suns 0.049709\n",
+ "267 Thanasis Antetokounmpo SF New York Knicks 0.030888\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.sort_values('SALARY', ascending=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of $\\$25$ million. Notice that the MVP Stephen Curry doesn't appear among the top 10. He is quite a bit further down the list, as we will see later."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Named Arguments\n",
+ "\n",
+ "The `descending=True` portion of this call expression is called a *named argument*. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Help on method sort_values in module pandas.core.frame:\n",
+ "This describes the positions, names, and default values of the three arguments to `sort_value`. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.\n",
+ "When an argument is simply `True` or `False`, it's a useful convention to include the argument name so that it's more obvious what the argument value means."
+ "Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than $\\$10$ million. Or we might just want the top five earners."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Specified Rows\n",
+ "The fact that a DataFrame creates an index by default startts to become very useful here as we can specify which rows (by default) we wish to inspect by stating an index or an index range. The argument used a row index or array of indices, and it creates a new DataFrame consisting of only those rows.\n",
+ "\n",
+ "For example, if we wanted just the first row of `nba`, we could use `df.iloc[]` as follows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "PLAYER Paul Millsap\n",
+ "POSITION PF\n",
+ "TEAM Atlanta Hawks\n",
+ "SALARY 18.6717\n",
+ "Name: 0, dtype: object"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[[0]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is a new table with just the single row that we specified.\n",
+ "\n",
+ "We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>Thabo Sefolosha</td>\n",
+ " <td>SF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>4.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ "5 Thabo Sefolosha SF Atlanta Hawks 4.000000"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[np.arange(3, 6)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If we want a table of the top 5 highest paid players, we can first sort the list by salary and then `df.iloc[]` the first five rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "### Rows Corresponding to a Specified Feature\n",
+ "More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don't know ahead of time. For example, we might want data on all the players who made more than $\\$10$ million, but we don't want to spend time counting rows in the sorted table.\n",
+ "\n",
+ "Array version - if we wish to work with an array we can use `np.where(df['column'] criteria)`. \n",
+ "DataFrame version - to implement a selection criteria the df is called with selection criteria being applied to the df.col i.e. `df[df['column_name']criteria]`.\n",
+ "\n",
+ "In the first example, we extract the data for all those who earned more than $\\$10$ million."
+ "# or - this is an example of alternatives being available to select,\n",
+ "# this may depend upon preference, the task at hand, the impact of processing time or export requirements\n",
+ "\n",
+ "#nba[nba['SALARY'] >10]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The use of the argument `df[df[col] > 10]` ensured that each selected row had a value of `SALARY` that was greater than 10.\n",
+ "\n",
+ "There are 69 rows in the new table, corresponding to the 69 players who made more than $10$ million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the \"poorest\" of this group, at a salary of just over $10$ million dollars."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>368</th>\n",
+ " <td>DeMar DeRozan</td>\n",
+ " <td>SG</td>\n",
+ " <td>Toronto Raptors</td>\n",
+ " <td>10.050000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>298</th>\n",
+ " <td>Gerald Wallace</td>\n",
+ " <td>SF</td>\n",
+ " <td>Philadelphia 76ers</td>\n",
+ " <td>10.105855</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>204</th>\n",
+ " <td>Luol Deng</td>\n",
+ " <td>SF</td>\n",
+ " <td>Miami Heat</td>\n",
+ " <td>10.151612</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>144</th>\n",
+ " <td>Monta Ellis</td>\n",
+ " <td>SG</td>\n",
+ " <td>Indiana Pacers</td>\n",
+ " <td>10.300000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>95</th>\n",
+ " <td>Wilson Chandler</td>\n",
+ " <td>SF</td>\n",
+ " <td>Denver Nuggets</td>\n",
+ " <td>10.449438</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>69 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "368 DeMar DeRozan SG Toronto Raptors 10.050000\n",
+ "298 Gerald Wallace SF Philadelphia 76ers 10.105855\n",
+ "204 Luol Deng SF Miami Heat 10.151612\n",
+ "144 Monta Ellis SG Indiana Pacers 10.300000\n",
+ "95 Wilson Chandler SF Denver Nuggets 10.449438\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "\n",
+ "[69 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['SALARY'] >10].sort_values('SALARY')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "How much did Stephen Curry make? For the answer, we have to access the row where the value of `PLAYER` is equal to `Stephen Curry`. That is placed a table consisting of just one line:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['PLAYER'] == 'Stephen Curry']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Curry made just under $\\$11.4$ million dollars. That's a lot of money, but it's less than half the salary of LeBron James. You'll find that salary in the \"Top 5\" table earlier in this section, or you could find it replacing `'Stephen Curry'` by `'LeBron James'` in the line of code above.\n",
+ "\n",
+ "Thus for example you can get a DataFrame where the 'TEAM' is exactly equal to 'Golden State Warriors':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['TEAM'] == 'Golden State Warriors']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. By not using `.head()` at the end of the line all rows are shown, not just the first 10."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Multiple Features\n",
+ "You can access rows that have multiple specified features, by using the boolean `&` operator. For example, here is a way to extract all the Point Guards whose salaries were over $\\$15$ million."
+ "By now you will have realized that the general way to create a new df by selecting rows with a given feature is to use `&` or `OR` with the appropriate condition:\n",
+ "Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than $\\$10$ million. Or we might just want the top five earners."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Specified Rows\n",
+ "The fact that a DataFrame creates an index by default startts to become very useful here as we can specify which rows (by default) we wish to inspect by stating an index or an index range. The argument used a row index or array of indices, and it creates a new DataFrame consisting of only those rows.\n",
+ "\n",
+ "For example, if we wanted just the first row of `nba`, we could use `df.iloc[]` as follows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "\n",
+ "[417 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "PLAYER Paul Millsap\n",
+ "POSITION PF\n",
+ "TEAM Atlanta Hawks\n",
+ "SALARY 18.6717\n",
+ "Name: 0, dtype: object"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[[0]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is a new table with just the single row that we specified.\n",
+ "\n",
+ "We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>Thabo Sefolosha</td>\n",
+ " <td>SF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>4.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ "5 Thabo Sefolosha SF Atlanta Hawks 4.000000"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba.iloc[np.arange(3, 6)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If we want a table of the top 5 highest paid players, we can first sort the list by salary and then `df.iloc[]` the first five rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "### Rows Corresponding to a Specified Feature\n",
+ "More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don't know ahead of time. For example, we might want data on all the players who made more than $\\$10$ million, but we don't want to spend time counting rows in the sorted table.\n",
+ "\n",
+ "Array version - if we wish to work with an array we can use `np.where(df['column'] criteria)`. \n",
+ "DataFrame version - to implement a selection criteria the df is called with selection criteria being applied to the df.col i.e. `df[df['column_name']criteria]`.\n",
+ "\n",
+ "In the first example, we extract the data for all those who earned more than $\\$10$ million."
+ "# or - this is an example of alternatives being available to select,\n",
+ "# this may depend upon preference, the task at hand, the impact of processing time or export requirements\n",
+ "\n",
+ "#nba[nba['SALARY'] >10]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The use of the argument `df[df[col] > 10]` ensured that each selected row had a value of `SALARY` that was greater than 10.\n",
+ "\n",
+ "There are 69 rows in the new table, corresponding to the 69 players who made more than $10$ million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the \"poorest\" of this group, at a salary of just over $10$ million dollars."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>368</th>\n",
+ " <td>DeMar DeRozan</td>\n",
+ " <td>SG</td>\n",
+ " <td>Toronto Raptors</td>\n",
+ " <td>10.050000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>298</th>\n",
+ " <td>Gerald Wallace</td>\n",
+ " <td>SF</td>\n",
+ " <td>Philadelphia 76ers</td>\n",
+ " <td>10.105855</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>204</th>\n",
+ " <td>Luol Deng</td>\n",
+ " <td>SF</td>\n",
+ " <td>Miami Heat</td>\n",
+ " <td>10.151612</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>144</th>\n",
+ " <td>Monta Ellis</td>\n",
+ " <td>SG</td>\n",
+ " <td>Indiana Pacers</td>\n",
+ " <td>10.300000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>95</th>\n",
+ " <td>Wilson Chandler</td>\n",
+ " <td>SF</td>\n",
+ " <td>Denver Nuggets</td>\n",
+ " <td>10.449438</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>131</th>\n",
+ " <td>Dwight Howard</td>\n",
+ " <td>C</td>\n",
+ " <td>Houston Rockets</td>\n",
+ " <td>22.359364</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>255</th>\n",
+ " <td>Carmelo Anthony</td>\n",
+ " <td>SF</td>\n",
+ " <td>New York Knicks</td>\n",
+ " <td>22.875000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>72</th>\n",
+ " <td>LeBron James</td>\n",
+ " <td>SF</td>\n",
+ " <td>Cleveland Cavaliers</td>\n",
+ " <td>22.970500</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>29</th>\n",
+ " <td>Joe Johnson</td>\n",
+ " <td>SF</td>\n",
+ " <td>Brooklyn Nets</td>\n",
+ " <td>24.894863</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>169</th>\n",
+ " <td>Kobe Bryant</td>\n",
+ " <td>SF</td>\n",
+ " <td>Los Angeles Lakers</td>\n",
+ " <td>25.000000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>69 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "368 DeMar DeRozan SG Toronto Raptors 10.050000\n",
+ "298 Gerald Wallace SF Philadelphia 76ers 10.105855\n",
+ "204 Luol Deng SF Miami Heat 10.151612\n",
+ "144 Monta Ellis SG Indiana Pacers 10.300000\n",
+ "95 Wilson Chandler SF Denver Nuggets 10.449438\n",
+ ".. ... ... ... ...\n",
+ "131 Dwight Howard C Houston Rockets 22.359364\n",
+ "255 Carmelo Anthony SF New York Knicks 22.875000\n",
+ "72 LeBron James SF Cleveland Cavaliers 22.970500\n",
+ "29 Joe Johnson SF Brooklyn Nets 24.894863\n",
+ "169 Kobe Bryant SF Los Angeles Lakers 25.000000\n",
+ "\n",
+ "[69 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['SALARY'] >10].sort_values('SALARY')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "How much did Stephen Curry make? For the answer, we have to access the row where the value of `PLAYER` is equal to `Stephen Curry`. That is placed a table consisting of just one line:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['PLAYER'] == 'Stephen Curry']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Curry made just under $\\$11.4$ million dollars. That's a lot of money, but it's less than half the salary of LeBron James. You'll find that salary in the \"Top 5\" table earlier in this section, or you could find it replacing `'Stephen Curry'` by `'LeBron James'` in the line of code above.\n",
+ "\n",
+ "Thus for example you can get a DataFrame where the 'TEAM' is exactly equal to 'Golden State Warriors':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>117</th>\n",
+ " <td>Klay Thompson</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>15.501000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>118</th>\n",
+ " <td>Draymond Green</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>14.260870</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>119</th>\n",
+ " <td>Andrew Bogut</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>13.800000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>120</th>\n",
+ " <td>Andre Iguodala</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.710456</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>121</th>\n",
+ " <td>Stephen Curry</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>11.370786</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>122</th>\n",
+ " <td>Jason Thompson</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>7.008475</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>123</th>\n",
+ " <td>Shaun Livingston</td>\n",
+ " <td>PG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>5.543725</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>124</th>\n",
+ " <td>Harrison Barnes</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.873398</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>125</th>\n",
+ " <td>Marreese Speights</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>3.815000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>126</th>\n",
+ " <td>Leandro Barbosa</td>\n",
+ " <td>SG</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.500000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>127</th>\n",
+ " <td>Festus Ezeli</td>\n",
+ " <td>C</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>2.008748</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>128</th>\n",
+ " <td>Brandon Rush</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.270964</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>129</th>\n",
+ " <td>Kevon Looney</td>\n",
+ " <td>SF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>1.131960</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>130</th>\n",
+ " <td>Anderson Varejao</td>\n",
+ " <td>PF</td>\n",
+ " <td>Golden State Warriors</td>\n",
+ " <td>0.289755</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "117 Klay Thompson SG Golden State Warriors 15.501000\n",
+ "118 Draymond Green PF Golden State Warriors 14.260870\n",
+ "119 Andrew Bogut C Golden State Warriors 13.800000\n",
+ "120 Andre Iguodala SF Golden State Warriors 11.710456\n",
+ "121 Stephen Curry PG Golden State Warriors 11.370786\n",
+ "122 Jason Thompson PF Golden State Warriors 7.008475\n",
+ "123 Shaun Livingston PG Golden State Warriors 5.543725\n",
+ "124 Harrison Barnes SF Golden State Warriors 3.873398\n",
+ "125 Marreese Speights C Golden State Warriors 3.815000\n",
+ "126 Leandro Barbosa SG Golden State Warriors 2.500000\n",
+ "127 Festus Ezeli C Golden State Warriors 2.008748\n",
+ "128 Brandon Rush SF Golden State Warriors 1.270964\n",
+ "129 Kevon Looney SF Golden State Warriors 1.131960\n",
+ "130 Anderson Varejao PF Golden State Warriors 0.289755"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "nba[nba['TEAM'] == 'Golden State Warriors']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. By not using `.head()` at the end of the line all rows are shown, not just the first 10."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Multiple Features\n",
+ "You can access rows that have multiple specified features, by using the boolean `&` operator. For example, here is a way to extract all the Point Guards whose salaries were over $\\$15$ million."
+ "By now you will have realized that the general way to create a new df by selecting rows with a given feature is to use `&` or `OR` with the appropriate condition:\n",
+ "We are now ready to work with large tables of data. The file below contains \"Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.\" Notice that `read_table` can read data directly from a URL."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>SEX</th>\n",
+ " <th>AGE</th>\n",
+ " <th>CENSUS2010POP</th>\n",
+ " <th>ESTIMATESBASE2010</th>\n",
+ " <th>POPESTIMATE2010</th>\n",
+ " <th>POPESTIMATE2011</th>\n",
+ " <th>POPESTIMATE2012</th>\n",
+ " <th>POPESTIMATE2013</th>\n",
+ " <th>POPESTIMATE2014</th>\n",
+ " <th>POPESTIMATE2015</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " <td>3944153</td>\n",
+ " <td>3944160</td>\n",
+ " <td>3951330</td>\n",
+ " <td>3963087</td>\n",
+ " <td>3926540</td>\n",
+ " <td>3931141</td>\n",
+ " <td>3949775</td>\n",
+ " <td>3978038</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " <td>3978070</td>\n",
+ " <td>3978090</td>\n",
+ " <td>3957888</td>\n",
+ " <td>3966551</td>\n",
+ " <td>3977939</td>\n",
+ " <td>3942872</td>\n",
+ " <td>3949776</td>\n",
+ " <td>3968564</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0</td>\n",
+ " <td>2</td>\n",
+ " <td>4096929</td>\n",
+ " <td>4096939</td>\n",
+ " <td>4090862</td>\n",
+ " <td>3971565</td>\n",
+ " <td>3980095</td>\n",
+ " <td>3992720</td>\n",
+ " <td>3959664</td>\n",
+ " <td>3966583</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0</td>\n",
+ " <td>3</td>\n",
+ " <td>4119040</td>\n",
+ " <td>4119051</td>\n",
+ " <td>4111920</td>\n",
+ " <td>4102470</td>\n",
+ " <td>3983157</td>\n",
+ " <td>3992734</td>\n",
+ " <td>4007079</td>\n",
+ " <td>3974061</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0</td>\n",
+ " <td>4</td>\n",
+ " <td>4063170</td>\n",
+ " <td>4063186</td>\n",
+ " <td>4077551</td>\n",
+ " <td>4122294</td>\n",
+ " <td>4112849</td>\n",
+ " <td>3994449</td>\n",
+ " <td>4005716</td>\n",
+ " <td>4020035</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>301</th>\n",
+ " <td>2</td>\n",
+ " <td>97</td>\n",
+ " <td>53582</td>\n",
+ " <td>53605</td>\n",
+ " <td>54118</td>\n",
+ " <td>57159</td>\n",
+ " <td>59533</td>\n",
+ " <td>61255</td>\n",
+ " <td>62779</td>\n",
+ " <td>69285</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>302</th>\n",
+ " <td>2</td>\n",
+ " <td>98</td>\n",
+ " <td>36641</td>\n",
+ " <td>36675</td>\n",
+ " <td>37532</td>\n",
+ " <td>40116</td>\n",
+ " <td>42857</td>\n",
+ " <td>44359</td>\n",
+ " <td>46208</td>\n",
+ " <td>47272</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>303</th>\n",
+ " <td>2</td>\n",
+ " <td>99</td>\n",
+ " <td>26193</td>\n",
+ " <td>26214</td>\n",
+ " <td>26074</td>\n",
+ " <td>27030</td>\n",
+ " <td>29320</td>\n",
+ " <td>31112</td>\n",
+ " <td>32517</td>\n",
+ " <td>34064</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>304</th>\n",
+ " <td>2</td>\n",
+ " <td>100</td>\n",
+ " <td>44202</td>\n",
+ " <td>44246</td>\n",
+ " <td>45058</td>\n",
+ " <td>47556</td>\n",
+ " <td>50661</td>\n",
+ " <td>53902</td>\n",
+ " <td>58008</td>\n",
+ " <td>61886</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>305</th>\n",
+ " <td>2</td>\n",
+ " <td>999</td>\n",
+ " <td>156964212</td>\n",
+ " <td>156969328</td>\n",
+ " <td>157258820</td>\n",
+ " <td>158427085</td>\n",
+ " <td>159581546</td>\n",
+ " <td>160720625</td>\n",
+ " <td>161952064</td>\n",
+ " <td>163189523</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>306 rows × 10 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 \\\n",
+ "# A local copy can be accessed here in case census.gov moves the file:\n",
+ "# data = path_data + 'nc-est2015-agesex-res.csv'\n",
+ "\n",
+ "full_census_table = pd.read_csv(data)\n",
+ "full_census_table"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Only the first 5 and last 5 rows of the DataFrame are displayed. Later we will see how to display the entire DataFrame; however, this is typically not useful with large tables.\n",
+ "\n",
+ "a [description of the table](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.\n",
+ "\n",
+ "Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us `select` the relevant columns."
+ "We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 -1555\n",
+ "1 -8112\n",
+ "2 -131198\n",
+ "3 -104841\n",
+ "4 -71835\n",
+ " ... \n",
+ "301 8661\n",
+ "302 8676\n",
+ "303 6443\n",
+ "304 12950\n",
+ "305 4693244\n",
+ "Length: 306, dtype: int64"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "us_pop['2014'] - us_pop['2010']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let us augment `us_pop` with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010."
+ "Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.\n",
+ "\n",
+ "The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.\n",
+ "\n",
+ "Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.\n",
+ "\n",
+ "By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.\n",
+ "\n",
+ "- Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945. \n",
+ "\n",
+ "- Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States. \n",
+ "\n",
+ "The post-war jump in births is the major reason for the large changes that we have observed."
+ "We are now ready to work with large tables of data. The file below contains \"Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.\" Notice that `read_table` can read data directly from a URL."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>SEX</th>\n",
+ " <th>AGE</th>\n",
+ " <th>CENSUS2010POP</th>\n",
+ " <th>ESTIMATESBASE2010</th>\n",
+ " <th>POPESTIMATE2010</th>\n",
+ " <th>POPESTIMATE2011</th>\n",
+ " <th>POPESTIMATE2012</th>\n",
+ " <th>POPESTIMATE2013</th>\n",
+ " <th>POPESTIMATE2014</th>\n",
+ " <th>POPESTIMATE2015</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " <td>3944153</td>\n",
+ " <td>3944160</td>\n",
+ " <td>3951330</td>\n",
+ " <td>3963087</td>\n",
+ " <td>3926540</td>\n",
+ " <td>3931141</td>\n",
+ " <td>3949775</td>\n",
+ " <td>3978038</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " <td>3978070</td>\n",
+ " <td>3978090</td>\n",
+ " <td>3957888</td>\n",
+ " <td>3966551</td>\n",
+ " <td>3977939</td>\n",
+ " <td>3942872</td>\n",
+ " <td>3949776</td>\n",
+ " <td>3968564</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0</td>\n",
+ " <td>2</td>\n",
+ " <td>4096929</td>\n",
+ " <td>4096939</td>\n",
+ " <td>4090862</td>\n",
+ " <td>3971565</td>\n",
+ " <td>3980095</td>\n",
+ " <td>3992720</td>\n",
+ " <td>3959664</td>\n",
+ " <td>3966583</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0</td>\n",
+ " <td>3</td>\n",
+ " <td>4119040</td>\n",
+ " <td>4119051</td>\n",
+ " <td>4111920</td>\n",
+ " <td>4102470</td>\n",
+ " <td>3983157</td>\n",
+ " <td>3992734</td>\n",
+ " <td>4007079</td>\n",
+ " <td>3974061</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0</td>\n",
+ " <td>4</td>\n",
+ " <td>4063170</td>\n",
+ " <td>4063186</td>\n",
+ " <td>4077551</td>\n",
+ " <td>4122294</td>\n",
+ " <td>4112849</td>\n",
+ " <td>3994449</td>\n",
+ " <td>4005716</td>\n",
+ " <td>4020035</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>301</th>\n",
+ " <td>2</td>\n",
+ " <td>97</td>\n",
+ " <td>53582</td>\n",
+ " <td>53605</td>\n",
+ " <td>54118</td>\n",
+ " <td>57159</td>\n",
+ " <td>59533</td>\n",
+ " <td>61255</td>\n",
+ " <td>62779</td>\n",
+ " <td>69285</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>302</th>\n",
+ " <td>2</td>\n",
+ " <td>98</td>\n",
+ " <td>36641</td>\n",
+ " <td>36675</td>\n",
+ " <td>37532</td>\n",
+ " <td>40116</td>\n",
+ " <td>42857</td>\n",
+ " <td>44359</td>\n",
+ " <td>46208</td>\n",
+ " <td>47272</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>303</th>\n",
+ " <td>2</td>\n",
+ " <td>99</td>\n",
+ " <td>26193</td>\n",
+ " <td>26214</td>\n",
+ " <td>26074</td>\n",
+ " <td>27030</td>\n",
+ " <td>29320</td>\n",
+ " <td>31112</td>\n",
+ " <td>32517</td>\n",
+ " <td>34064</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>304</th>\n",
+ " <td>2</td>\n",
+ " <td>100</td>\n",
+ " <td>44202</td>\n",
+ " <td>44246</td>\n",
+ " <td>45058</td>\n",
+ " <td>47556</td>\n",
+ " <td>50661</td>\n",
+ " <td>53902</td>\n",
+ " <td>58008</td>\n",
+ " <td>61886</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>305</th>\n",
+ " <td>2</td>\n",
+ " <td>999</td>\n",
+ " <td>156964212</td>\n",
+ " <td>156969328</td>\n",
+ " <td>157258820</td>\n",
+ " <td>158427085</td>\n",
+ " <td>159581546</td>\n",
+ " <td>160720625</td>\n",
+ " <td>161952064</td>\n",
+ " <td>163189523</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>306 rows × 10 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 \\\n",
+ "# A local copy can be accessed here in case census.gov moves the file:\n",
+ "# data = path_data + 'nc-est2015-agesex-res.csv'\n",
+ "\n",
+ "full_census_table = pd.read_csv(data)\n",
+ "full_census_table"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Only the first 5 and last 5 rows of the DataFrame are displayed. Later we will see how to display the entire DataFrame; however, this is typically not useful with large tables.\n",
+ "\n",
+ "a [description of the table](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.\n",
+ "\n",
+ "Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us `select` the relevant columns."
+ "We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 -1555\n",
+ "1 -8112\n",
+ "2 -131198\n",
+ "3 -104841\n",
+ "4 -71835\n",
+ " ... \n",
+ "301 8661\n",
+ "302 8676\n",
+ "303 6443\n",
+ "304 12950\n",
+ "305 4693244\n",
+ "Length: 306, dtype: int64"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "us_pop['2014'] - us_pop['2010']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let us augment `us_pop` with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010."
+ "Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.\n",
+ "\n",
+ "The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.\n",
+ "\n",
+ "Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.\n",
+ "\n",
+ "By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.\n",
+ "\n",
+ "- Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945. \n",
+ "\n",
+ "- Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States. \n",
+ "\n",
+ "The post-war jump in births is the major reason for the large changes that we have observed."
+ "DataFrames (df's) are a fundamental object type for representing data sets. A df can be viewed in two ways:\n",
+ "* a sequence of named columns that each describe a single aspect of all entries in a data set, or\n",
+ "* a sequence of rows that each contain all information about a single entry in a data set.\n",
+ "\n",
+ "In order to use a DataFrame, import all of the module called `pandas`, by convention this is usually imported and as `pd`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Empty tables can be created using the `pd.DataFrame()` function. An empty table is usefuly because it can be extended to contain new rows and columns."
+ "When a new column is added to a Dataframe a new DatFrame is **not** created, so the original DataFrame is affected. For example, the original DatFrame `flowers` before the third was added."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Number of petals</th>\n",
+ " <th>Name</th>\n",
+ " <th>Color</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>8</td>\n",
+ " <td>lotus</td>\n",
+ " <td>{pink, yellow, red}</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>34</td>\n",
+ " <td>sunflower</td>\n",
+ " <td>{pink, yellow, red}</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>5</td>\n",
+ " <td>rose</td>\n",
+ " <td>{pink, yellow, red}</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Number of petals Name Color\n",
+ "0 8 lotus {pink, yellow, red}\n",
+ "1 34 sunflower {pink, yellow, red}\n",
+ "2 5 rose {pink, yellow, red}"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "flowers"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Before** adding a third column a copy of df 'flowers' is created, in this case the new df created is called flowers_two_col. 'flowers_two_col = flowers`.copy()`'\n",
+ "Creating dfs in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.\n",
+ "\n",
+ "Often, dfs are created from files that contain comma-separated values. Such files are called CSV files.\n",
+ "\n",
+ "Below, we use the Table method `pd.read_csv()` to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon's Russian campaign. The data are placed in a df named `minard`.\n",
+ "We will use this small df to demonstrate some useful DataFrame methods. We will then use those same methods, and develop other methods, on much larger DataFrames."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The Size of the Table\n",
+ "The method `df.shape(1)` gives the number of columns in the table, and `df.shape(0)` the number of rows.\n",
+ "The number of rows in a df can also be found by using the `len()` function. For number of rows `len(df.rows)`, and number of columns `len(df.columns)`. As the default parameter for the `len()` function is set for number of rows and if we want to know the number of rows we don't usually add '.rows' "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "5"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(minard.columns)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "8"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(minard)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Column Labels\n",
+ "\n",
+ "The method `.columns` can be used to list the labels of all the columns. With `minard` we don't gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen."
+ "We can change column labels using the `rename(columns={})` method. This creates a **new** df and leaves `minard` unchanged."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " <th>City Name</th>\n",
+ " <th>Direction</th>\n",
+ " <th>Survivors</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Advance</td>\n",
+ " <td>145000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " <td>Dorogobouge</td>\n",
+ " <td>Advance</td>\n",
+ " <td>140000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " <td>Chjat</td>\n",
+ " <td>Advance</td>\n",
+ " <td>127100</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " <td>Moscou</td>\n",
+ " <td>Advance</td>\n",
+ " <td>100000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " <td>Wixma</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>55000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>24000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " <td>Orscha</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>20000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " <td>Moiodexno</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>12000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude City Name Direction Survivors\n",
+ "0 32.0 54.8 Smolensk Advance 145000\n",
+ "1 33.2 54.9 Dorogobouge Advance 140000\n",
+ "2 34.4 55.5 Chjat Advance 127100\n",
+ "3 37.6 55.8 Moscou Advance 100000\n",
+ "4 34.3 55.2 Wixma Retreat 55000\n",
+ "5 32.0 54.6 Smolensk Retreat 24000\n",
+ "6 30.4 54.4 Orscha Retreat 20000\n",
+ "7 26.8 54.3 Moiodexno Retreat 12000"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard.rename(columns={'City':'City Name'})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, this method does not change the original DataFrame. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " <th>City</th>\n",
+ " <th>Direction</th>\n",
+ " <th>Survivors</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Advance</td>\n",
+ " <td>145000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " <td>Dorogobouge</td>\n",
+ " <td>Advance</td>\n",
+ " <td>140000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " <td>Chjat</td>\n",
+ " <td>Advance</td>\n",
+ " <td>127100</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " <td>Moscou</td>\n",
+ " <td>Advance</td>\n",
+ " <td>100000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " <td>Wixma</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>55000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>24000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " <td>Orscha</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>20000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " <td>Moiodexno</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>12000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude City Direction Survivors\n",
+ "0 32.0 54.8 Smolensk Advance 145000\n",
+ "1 33.2 54.9 Dorogobouge Advance 140000\n",
+ "2 34.4 55.5 Chjat Advance 127100\n",
+ "3 37.6 55.8 Moscou Advance 100000\n",
+ "4 34.3 55.2 Wixma Retreat 55000\n",
+ "5 32.0 54.6 Smolensk Retreat 24000\n",
+ "6 30.4 54.4 Orscha Retreat 20000\n",
+ "7 26.8 54.3 Moiodexno Retreat 12000"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A common pattern is to **assign** the original name `minard` to the new table, so that all future uses of `minard` will refer to the relabeled table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " <th>City Name</th>\n",
+ " <th>Direction</th>\n",
+ " <th>Survivors</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Advance</td>\n",
+ " <td>145000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " <td>Dorogobouge</td>\n",
+ " <td>Advance</td>\n",
+ " <td>140000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " <td>Chjat</td>\n",
+ " <td>Advance</td>\n",
+ " <td>127100</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " <td>Moscou</td>\n",
+ " <td>Advance</td>\n",
+ " <td>100000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " <td>Wixma</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>55000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>24000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " <td>Orscha</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>20000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " <td>Moiodexno</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>12000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude City Name Direction Survivors\n",
+ "We can use a column's label to access the array of data in the column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 145000\n",
+ "1 140000\n",
+ "2 127100\n",
+ "3 100000\n",
+ "4 55000\n",
+ "5 24000\n",
+ "6 20000\n",
+ "7 12000\n",
+ "Name: Survivors, dtype: int64"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard['Survivors']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### type( )\n",
+ "\n",
+ "To determine the tupe of object created we can use the `type()` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "pandas.core.frame.DataFrame"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(minard)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using two sets of square brackets the output is displayed in DataFrame format."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Survivors</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>145000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>140000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>127100</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>100000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>55000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>24000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>20000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>12000</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Survivors\n",
+ "0 145000\n",
+ "1 140000\n",
+ "2 127100\n",
+ "3 100000\n",
+ "4 55000\n",
+ "5 24000\n",
+ "6 20000\n",
+ "7 12000"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard[['Survivors']]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "pandas.core.frame.DataFrame"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(minard)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### iLoc[ ]\n",
+ "\n",
+ "(index location)\n",
+ "\n",
+ "The 5 columns are indexed 0, 1, 2, 3, and 4. The column `Survivors` can also be accessed by using the `iloc[]` method with the required column index. Notice that to select a column using the `iloc[]` method we have to first place a colon followed by a comma in the swuare brackets due to the default setting for `iloc[]` being set to 'rows'.\n",
+ "The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using `item`, as with any array."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "145000"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard.iloc[:,4][0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "24000"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard.iloc[:,4][5]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Alternatively \n",
+ "\n",
+ "if we wish to find a particular member of a row we select a row rather than a column. Notice that in this instance we have selected the 4th row and the 4th column, remembering that though there are 5 columns Pandas refers to the first column as column 0 and first row as row 0."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "24000"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard.iloc[5][4]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Working with the Data in a Column\n",
+ "Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " <th>City Name</th>\n",
+ " <th>Direction</th>\n",
+ " <th>Survivors</th>\n",
+ " <th>Percent Surviving</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Advance</td>\n",
+ " <td>145000</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " <td>Dorogobouge</td>\n",
+ " <td>Advance</td>\n",
+ " <td>140000</td>\n",
+ " <td>0.965517</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " <td>Chjat</td>\n",
+ " <td>Advance</td>\n",
+ " <td>127100</td>\n",
+ " <td>0.876552</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " <td>Moscou</td>\n",
+ " <td>Advance</td>\n",
+ " <td>100000</td>\n",
+ " <td>0.689655</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " <td>Wixma</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>55000</td>\n",
+ " <td>0.379310</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>24000</td>\n",
+ " <td>0.165517</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " <td>Orscha</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>20000</td>\n",
+ " <td>0.137931</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " <td>Moiodexno</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>12000</td>\n",
+ " <td>0.082759</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude City Name Direction Survivors Percent Surviving\n",
+ "**N.B.** a peculiarity of the Jupyter notebook is that if you make a mistake e.g. misspelling a column name, when you run the formatting function a nwe column will be created. to remive this colummn you must retart the kernel. \n",
+ "\n",
+ "*Toolbar - Kernel - Restart & Clear Output*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Choosing Sets of Columns\n",
+ "To select particular columns we can use `df.['col1', 'col2']` which creates a new table that contains only the specified columns. When selecting a single column we can use one set of square brackets, when selecting multiple columns two sets of swuare brackets are required."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude\n",
+ "0 32.0 54.8\n",
+ "1 33.2 54.9\n",
+ "2 34.4 55.5\n",
+ "3 37.6 55.8\n",
+ "4 34.3 55.2\n",
+ "5 32.0 54.6\n",
+ "6 30.4 54.4\n",
+ "7 26.8 54.3"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard[['Longitude', 'Latitude']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The same selection can be made using column indices instead of labels.\n",
+ "\n",
+ "**N.B.** the column range selected is 0:2 with the range being *bottom heavy*. Though the range bottom limit is 0 and the top limit is 2 instead of processing elements 0, 1 and 2 only elements 0 and 1 will be processed i.e. *bottom heavy* or *top light*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude\n",
+ "0 32.0 54.8\n",
+ "1 33.2 54.9\n",
+ "2 34.4 55.5\n",
+ "3 37.6 55.8\n",
+ "4 34.3 55.2\n",
+ "5 32.0 54.6\n",
+ "6 30.4 54.4\n",
+ "7 26.8 54.3"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard.iloc[:, 0:2]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result of using `df.[' ']` is a new DataFrame, even when you select just one column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 145000\n",
+ "1 140000\n",
+ "2 127100\n",
+ "3 100000\n",
+ "4 55000\n",
+ "5 24000\n",
+ "6 20000\n",
+ "7 12000\n",
+ "Name: Survivors, dtype: int64"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "minard['Survivors']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Another way to create a new table consisting of a set of columns is to `drop` the columns you don't want."
+ "Neither `df.[' ']` nor `drop` change the original DataFrame. Instead, they create new smaller DataFrames that share the same data. The fact that the original DataFrame is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Longitude</th>\n",
+ " <th>Latitude</th>\n",
+ " <th>City Name</th>\n",
+ " <th>Direction</th>\n",
+ " <th>Survivors</th>\n",
+ " <th>Percent Surviving</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.8</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Advance</td>\n",
+ " <td>145000</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>33.2</td>\n",
+ " <td>54.9</td>\n",
+ " <td>Dorogobouge</td>\n",
+ " <td>Advance</td>\n",
+ " <td>140000</td>\n",
+ " <td>0.965517</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>34.4</td>\n",
+ " <td>55.5</td>\n",
+ " <td>Chjat</td>\n",
+ " <td>Advance</td>\n",
+ " <td>127100</td>\n",
+ " <td>0.876552</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>37.6</td>\n",
+ " <td>55.8</td>\n",
+ " <td>Moscou</td>\n",
+ " <td>Advance</td>\n",
+ " <td>100000</td>\n",
+ " <td>0.689655</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>34.3</td>\n",
+ " <td>55.2</td>\n",
+ " <td>Wixma</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>55000</td>\n",
+ " <td>0.379310</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>32.0</td>\n",
+ " <td>54.6</td>\n",
+ " <td>Smolensk</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>24000</td>\n",
+ " <td>0.165517</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>30.4</td>\n",
+ " <td>54.4</td>\n",
+ " <td>Orscha</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>20000</td>\n",
+ " <td>0.137931</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>26.8</td>\n",
+ " <td>54.3</td>\n",
+ " <td>Moiodexno</td>\n",
+ " <td>Retreat</td>\n",
+ " <td>12000</td>\n",
+ " <td>0.082759</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Longitude Latitude City Name Direction Survivors Percent Surviving\n",
+ "Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton's data on heights, we saw that it was useful to classify families according to the parents' midparent heights, and then find the average height of the children in each group.\n",
+ "\n",
+ "This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of `group`. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Counting the Number in Each Category\n",
+ "The `group` method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.\n",
+ "\n",
+ "Here is a small table of data on ice cream cones. The `group` method can be used to list the distinct flavors and provide the counts of each flavor."
+ "There are two distinct categories, chocolate and strawberry. When we call `groupby` we must state what we want to do with the group data e.g. `count()`. Applying the `count()` method will create a column of counts whcih takes the names of the first column in the df by default, and contains the number of rows in each category. To make this easier to read we could change the count column to 'count'.\n",
+ "\n",
+ "Notice that this can all be worked out from just the `Flavor` column. Only the `Price` column name has been used, the data has not been used.\n",
+ "\n",
+ "But what if we wanted the total price of the cones of each different flavor? In this case we can apply a different method e.g. `sum()`, to `groupby`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Finding a Characteristic of Each Category\n",
+ "The optional second argument of `group` names the function that will be used to aggregate values in other columns for all of those rows. For instance, `sum` will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.\n",
+ "\n",
+ "To find the total price of each flavor, we call `group` again, with `Flavor` as its first argument as before. But this time there is a second argument: the function name `sum`."
+ "To create this new table, `groupby` has calculated the **sum** of the `Price` entries in all the rows corresponding to each distinct flavor. The prices in the three `chocolate` rows add up to 16.55 (in whatever currency). The prices in the two `strawberry` rows have a total of 8.80.\n",
+ "\n",
+ "Using pandas `groupby aggregation` we can compute a summary statistic (or statistics). The label of the newly created column is `Price sum`, which is created by taking the label of the column being summed, and appending the word `sum` in the *aggregation pipeline*. \n",
+ "\n",
+ "In this insatnce there are only two columns so when `group` finds the `sum` of all columns other than the one with the categories, there is no need to specify that it has to `sum` the prices. Using the [`Pandas NamedAgg`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation) function we can name the column contining the results of the aggregation.\n",
+ "\n",
+ "To see in more detail what `group` is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:"
+ "Once again, `groupby` creates arrays of the prices in each `Flavor` category. But now it finds the `max` of each array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Array of All the Prices</th>\n",
+ " <th>Sum of the Array</th>\n",
+ " <th>Max of the Array</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>[4.75, 6.55, 5.25]</td>\n",
+ " <td>16.55</td>\n",
+ " <td>6.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>[3.55, 5.25]</td>\n",
+ " <td>8.80</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Array of All the Prices Sum of the Array Max of the Array\n",
+ "0 chocolate [4.75, 6.55, 5.25] 16.55 6.55\n",
+ "1 strawberry [3.55, 5.25] 8.80 5.25"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "price_max = grouped_cones.copy()\n",
+ "\n",
+ "price_max['Max of the Array'] = np.array([max(cones_choc), max(cones_strawb)])\n",
+ "\n",
+ "price_max"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Indeed, the original call to `group` with just one argument has the same effect as using `len` as the function and then cleaning up the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Array of All the Prices</th>\n",
+ " <th>Sum of the Array</th>\n",
+ " <th>Length of the Array</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>[4.75, 6.55, 5.25]</td>\n",
+ " <td>16.55</td>\n",
+ " <td>3</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>[3.55, 5.25]</td>\n",
+ " <td>8.80</td>\n",
+ " <td>2</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Array of All the Prices Sum of the Array Length of the Array\n",
+ "0 chocolate [4.75, 6.55, 5.25] 16.55 3\n",
+ "1 strawberry [3.55, 5.25] 8.80 2"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "array_length = grouped_cones.copy()\n",
+ "\n",
+ "array_length['Length of the Array'] = np.array([len(cones_choc), len(cones_strawb)])\n",
+ "\n",
+ "array_length"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: NBA Salaries\n",
+ "The table `nba` contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "**1.** How much money did each team pay for its players' salaries?\n",
+ "\n",
+ "The only columns involved are `TEAM` and `SALARY`. We have to `group` the rows by `TEAM` and then `sum` the salaries of the groups. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>TEAM</th>\n",
+ " <th></th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>Atlanta Hawks</th>\n",
+ " <td>69.573103</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Boston Celtics</th>\n",
+ " <td>50.285499</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Brooklyn Nets</th>\n",
+ " <td>57.306976</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Charlotte Hornets</th>\n",
+ " <td>84.102397</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Chicago Bulls</th>\n",
+ " <td>78.820890</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Cleveland Cavaliers</th>\n",
+ " <td>102.312412</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Dallas Mavericks</th>\n",
+ " <td>65.762559</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Denver Nuggets</th>\n",
+ " <td>62.429404</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Detroit Pistons</th>\n",
+ " <td>42.211760</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Golden State Warriors</th>\n",
+ " <td>94.085137</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Houston Rockets</th>\n",
+ " <td>85.285837</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Indiana Pacers</th>\n",
+ " <td>62.695023</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Los Angeles Clippers</th>\n",
+ " <td>66.074113</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Los Angeles Lakers</th>\n",
+ " <td>68.607944</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Memphis Grizzlies</th>\n",
+ " <td>93.796439</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Miami Heat</th>\n",
+ " <td>81.528667</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Milwaukee Bucks</th>\n",
+ " <td>52.258355</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Minnesota Timberwolves</th>\n",
+ " <td>65.847421</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>New Orleans Pelicans</th>\n",
+ " <td>80.514606</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>New York Knicks</th>\n",
+ " <td>69.404994</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Oklahoma City Thunder</th>\n",
+ " <td>96.832165</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Orlando Magic</th>\n",
+ " <td>77.623940</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Philadelphia 76ers</th>\n",
+ " <td>42.481345</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Phoenix Suns</th>\n",
+ " <td>50.520815</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Portland Trail Blazers</th>\n",
+ " <td>45.446878</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Sacramento Kings</th>\n",
+ " <td>68.384890</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>San Antonio Spurs</th>\n",
+ " <td>84.652074</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Toronto Raptors</th>\n",
+ " <td>74.672620</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Utah Jazz</th>\n",
+ " <td>52.631878</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Washington Wizards</th>\n",
+ " <td>90.047498</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " SALARY\n",
+ "TEAM \n",
+ "Atlanta Hawks 69.573103\n",
+ "Boston Celtics 50.285499\n",
+ "Brooklyn Nets 57.306976\n",
+ "Charlotte Hornets 84.102397\n",
+ "Chicago Bulls 78.820890\n",
+ "Cleveland Cavaliers 102.312412\n",
+ "Dallas Mavericks 65.762559\n",
+ "Denver Nuggets 62.429404\n",
+ "Detroit Pistons 42.211760\n",
+ "Golden State Warriors 94.085137\n",
+ "Houston Rockets 85.285837\n",
+ "Indiana Pacers 62.695023\n",
+ "Los Angeles Clippers 66.074113\n",
+ "Los Angeles Lakers 68.607944\n",
+ "Memphis Grizzlies 93.796439\n",
+ "Miami Heat 81.528667\n",
+ "Milwaukee Bucks 52.258355\n",
+ "Minnesota Timberwolves 65.847421\n",
+ "New Orleans Pelicans 80.514606\n",
+ "New York Knicks 69.404994\n",
+ "Oklahoma City Thunder 96.832165\n",
+ "Orlando Magic 77.623940\n",
+ "Philadelphia 76ers 42.481345\n",
+ "Phoenix Suns 50.520815\n",
+ "Portland Trail Blazers 45.446878\n",
+ "Sacramento Kings 68.384890\n",
+ "San Antonio Spurs 84.652074\n",
+ "Toronto Raptors 74.672620\n",
+ "Utah Jazz 52.631878\n",
+ "Washington Wizards 90.047498"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "teams_and_money = nba[['TEAM', 'SALARY']]\n",
+ "\n",
+ "teams_and_money.groupby('TEAM').sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**2.** How many NBA players were there in each of the five positions?\n",
+ "\n",
+ "We have to classify by `POSITION`, and count. This can be achieved by applying the `count()` method to a `groupby` or and the aggregation method with `aggfunc=\"count\"`."
+ "**3.** What was the average salary of the players at each of the five positions?\n",
+ "\n",
+ "This time, we have to group by `POSITION` and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries."
+ "Center was the most highly paid position, at an average of over 6 million dollars.\n",
+ "\n",
+ "If we had not selected the two columns as our first step, `group` would not attempt to \"average\" the categorical columns in `nba`. (It is impossible to average two strings like \"Atlanta Hawks\" and \"Boston Celtics\".) It performs arithmetic only on numerical columns and leaves the rest blank."
+ "Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton's data on heights, we saw that it was useful to classify families according to the parents' midparent heights, and then find the average height of the children in each group.\n",
+ "\n",
+ "This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of `group`. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Counting the Number in Each Category\n",
+ "The `group` method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.\n",
+ "\n",
+ "Here is a small table of data on ice cream cones. The `group` method can be used to list the distinct flavors and provide the counts of each flavor."
+ "There are two distinct categories, chocolate and strawberry. When we call `groupby` we must state what we want to do with the group data e.g. `count()`. Applying the `count()` method will create a column of counts whcih takes the names of the first column in the df by default, and contains the number of rows in each category. To make this easier to read we could change the count column to 'count'.\n",
+ "\n",
+ "Notice that this can all be worked out from just the `Flavor` column. Only the `Price` column name has been used, the data has not been used.\n",
+ "\n",
+ "But what if we wanted the total price of the cones of each different flavor? In this case we can apply a different method e.g. `sum()`, to `groupby`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Finding a Characteristic of Each Category\n",
+ "The optional second argument of `group` names the function that will be used to aggregate values in other columns for all of those rows. For instance, `sum` will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.\n",
+ "\n",
+ "To find the total price of each flavor, we call `group` again, with `Flavor` as its first argument as before. But this time there is a second argument: the function name `sum`."
+ "To create this new table, `groupby` has calculated the **sum** of the `Price` entries in all the rows corresponding to each distinct flavor. The prices in the three `chocolate` rows add up to 16.55 (in whatever currency). The prices in the two `strawberry` rows have a total of 8.80.\n",
+ "\n",
+ "Using pandas `groupby aggregation` we can compute a summary statistic (or statistics). The label of the newly created column is `Price sum`, which is created by taking the label of the column being summed, and appending the word `sum` in the *aggregation pipeline*. \n",
+ "\n",
+ "In this insatnce there are only two columns so when `group` finds the `sum` of all columns other than the one with the categories, there is no need to specify that it has to `sum` the prices. Using the [`Pandas NamedAgg`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation) function we can name the column contining the results of the aggregation.\n",
+ "\n",
+ "To see in more detail what `group` is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:"
+ "Once again, `groupby` creates arrays of the prices in each `Flavor` category. But now it finds the `max` of each array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Array of All the Prices</th>\n",
+ " <th>Sum of the Array</th>\n",
+ " <th>Max of the Array</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>[4.75, 6.55, 5.25]</td>\n",
+ " <td>16.55</td>\n",
+ " <td>6.55</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>[3.55, 5.25]</td>\n",
+ " <td>8.80</td>\n",
+ " <td>5.25</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Array of All the Prices Sum of the Array Max of the Array\n",
+ "0 chocolate [4.75, 6.55, 5.25] 16.55 6.55\n",
+ "1 strawberry [3.55, 5.25] 8.80 5.25"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "price_max = grouped_cones.copy()\n",
+ "\n",
+ "price_max['Max of the Array'] = np.array([max(cones_choc), max(cones_strawb)])\n",
+ "\n",
+ "price_max"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Indeed, the original call to `group` with just one argument has the same effect as using `len` as the function and then cleaning up the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>Flavor</th>\n",
+ " <th>Array of All the Prices</th>\n",
+ " <th>Sum of the Array</th>\n",
+ " <th>Length of the Array</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>chocolate</td>\n",
+ " <td>[4.75, 6.55, 5.25]</td>\n",
+ " <td>16.55</td>\n",
+ " <td>3</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>strawberry</td>\n",
+ " <td>[3.55, 5.25]</td>\n",
+ " <td>8.80</td>\n",
+ " <td>2</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " Flavor Array of All the Prices Sum of the Array Length of the Array\n",
+ "0 chocolate [4.75, 6.55, 5.25] 16.55 3\n",
+ "1 strawberry [3.55, 5.25] 8.80 2"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "array_length = grouped_cones.copy()\n",
+ "\n",
+ "array_length['Length of the Array'] = np.array([len(cones_choc), len(cones_strawb)])\n",
+ "\n",
+ "array_length"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: NBA Salaries\n",
+ "The table `nba` contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>PLAYER</th>\n",
+ " <th>POSITION</th>\n",
+ " <th>TEAM</th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>Paul Millsap</td>\n",
+ " <td>PF</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>18.671659</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>Al Horford</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>12.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>Tiago Splitter</td>\n",
+ " <td>C</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>9.756250</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>Jeff Teague</td>\n",
+ " <td>PG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>8.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>Kyle Korver</td>\n",
+ " <td>SG</td>\n",
+ " <td>Atlanta Hawks</td>\n",
+ " <td>5.746479</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>...</th>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " <td>...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>412</th>\n",
+ " <td>Gary Neal</td>\n",
+ " <td>PG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.139000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>413</th>\n",
+ " <td>DeJuan Blair</td>\n",
+ " <td>C</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>2.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>414</th>\n",
+ " <td>Kelly Oubre Jr.</td>\n",
+ " <td>SF</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.920240</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>415</th>\n",
+ " <td>Garrett Temple</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>1.100602</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>416</th>\n",
+ " <td>Jarell Eddie</td>\n",
+ " <td>SG</td>\n",
+ " <td>Washington Wizards</td>\n",
+ " <td>0.561716</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>417 rows × 4 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " PLAYER POSITION TEAM SALARY\n",
+ "0 Paul Millsap PF Atlanta Hawks 18.671659\n",
+ "1 Al Horford C Atlanta Hawks 12.000000\n",
+ "2 Tiago Splitter C Atlanta Hawks 9.756250\n",
+ "3 Jeff Teague PG Atlanta Hawks 8.000000\n",
+ "4 Kyle Korver SG Atlanta Hawks 5.746479\n",
+ ".. ... ... ... ...\n",
+ "412 Gary Neal PG Washington Wizards 2.139000\n",
+ "413 DeJuan Blair C Washington Wizards 2.000000\n",
+ "414 Kelly Oubre Jr. SF Washington Wizards 1.920240\n",
+ "415 Garrett Temple SG Washington Wizards 1.100602\n",
+ "416 Jarell Eddie SG Washington Wizards 0.561716\n",
+ "**1.** How much money did each team pay for its players' salaries?\n",
+ "\n",
+ "The only columns involved are `TEAM` and `SALARY`. We have to `group` the rows by `TEAM` and then `sum` the salaries of the groups. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>SALARY</th>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>TEAM</th>\n",
+ " <th></th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>Atlanta Hawks</th>\n",
+ " <td>69.573103</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Boston Celtics</th>\n",
+ " <td>50.285499</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Brooklyn Nets</th>\n",
+ " <td>57.306976</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Charlotte Hornets</th>\n",
+ " <td>84.102397</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Chicago Bulls</th>\n",
+ " <td>78.820890</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Cleveland Cavaliers</th>\n",
+ " <td>102.312412</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Dallas Mavericks</th>\n",
+ " <td>65.762559</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Denver Nuggets</th>\n",
+ " <td>62.429404</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Detroit Pistons</th>\n",
+ " <td>42.211760</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Golden State Warriors</th>\n",
+ " <td>94.085137</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Houston Rockets</th>\n",
+ " <td>85.285837</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Indiana Pacers</th>\n",
+ " <td>62.695023</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Los Angeles Clippers</th>\n",
+ " <td>66.074113</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Los Angeles Lakers</th>\n",
+ " <td>68.607944</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Memphis Grizzlies</th>\n",
+ " <td>93.796439</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Miami Heat</th>\n",
+ " <td>81.528667</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Milwaukee Bucks</th>\n",
+ " <td>52.258355</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Minnesota Timberwolves</th>\n",
+ " <td>65.847421</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>New Orleans Pelicans</th>\n",
+ " <td>80.514606</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>New York Knicks</th>\n",
+ " <td>69.404994</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Oklahoma City Thunder</th>\n",
+ " <td>96.832165</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Orlando Magic</th>\n",
+ " <td>77.623940</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Philadelphia 76ers</th>\n",
+ " <td>42.481345</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Phoenix Suns</th>\n",
+ " <td>50.520815</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Portland Trail Blazers</th>\n",
+ " <td>45.446878</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Sacramento Kings</th>\n",
+ " <td>68.384890</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>San Antonio Spurs</th>\n",
+ " <td>84.652074</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Toronto Raptors</th>\n",
+ " <td>74.672620</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Utah Jazz</th>\n",
+ " <td>52.631878</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>Washington Wizards</th>\n",
+ " <td>90.047498</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " SALARY\n",
+ "TEAM \n",
+ "Atlanta Hawks 69.573103\n",
+ "Boston Celtics 50.285499\n",
+ "Brooklyn Nets 57.306976\n",
+ "Charlotte Hornets 84.102397\n",
+ "Chicago Bulls 78.820890\n",
+ "Cleveland Cavaliers 102.312412\n",
+ "Dallas Mavericks 65.762559\n",
+ "Denver Nuggets 62.429404\n",
+ "Detroit Pistons 42.211760\n",
+ "Golden State Warriors 94.085137\n",
+ "Houston Rockets 85.285837\n",
+ "Indiana Pacers 62.695023\n",
+ "Los Angeles Clippers 66.074113\n",
+ "Los Angeles Lakers 68.607944\n",
+ "Memphis Grizzlies 93.796439\n",
+ "Miami Heat 81.528667\n",
+ "Milwaukee Bucks 52.258355\n",
+ "Minnesota Timberwolves 65.847421\n",
+ "New Orleans Pelicans 80.514606\n",
+ "New York Knicks 69.404994\n",
+ "Oklahoma City Thunder 96.832165\n",
+ "Orlando Magic 77.623940\n",
+ "Philadelphia 76ers 42.481345\n",
+ "Phoenix Suns 50.520815\n",
+ "Portland Trail Blazers 45.446878\n",
+ "Sacramento Kings 68.384890\n",
+ "San Antonio Spurs 84.652074\n",
+ "Toronto Raptors 74.672620\n",
+ "Utah Jazz 52.631878\n",
+ "Washington Wizards 90.047498"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "teams_and_money = nba[['TEAM', 'SALARY']]\n",
+ "\n",
+ "teams_and_money.groupby('TEAM').sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**2.** How many NBA players were there in each of the five positions?\n",
+ "\n",
+ "We have to classify by `POSITION`, and count. This can be achieved by applying the `count()` method to a `groupby` or and the aggregation method with `aggfunc=\"count\"`."
+ "**3.** What was the average salary of the players at each of the five positions?\n",
+ "\n",
+ "This time, we have to group by `POSITION` and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries."
+ "Center was the most highly paid position, at an average of over 6 million dollars.\n",
+ "\n",
+ "If we had not selected the two columns as our first step, `group` would not attempt to \"average\" the categorical columns in `nba`. (It is impossible to average two strings like \"Atlanta Hawks\" and \"Boston Celtics\".) It performs arithmetic only on numerical columns and leaves the rest blank."
+ "Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student's time to completion of degree, while another has data about the student's tuition and financial aid.\n",
+ "\n",
+ "To understand the *students'* experience, it may be helpful to put the two datasets together. If the data are in two tables, each with one row per student, then we would want to put the columns together, making sure to match the rows so that each student's information remains on a single row.\n",
+ "\n",
+ "Let us do this in the context of a simple example, and then use the method with a larger dataset.\n",
+ "Each of the tables has a column that contains ice cream flavors: `cones` has the column `Flavor`, and `ratings` has the column `Kind`. The entries in these columns can be used to link the two tables.\n",
+ "\n",
+ "The method `join` creates a new table in which each cone in the `cones` table is augmented with the Stars information in the `ratings` table. For each cone in `cones`, `join` finds a row in `ratings` whose `Kind` matches the cone's `Flavor`. \n",
+ "\n",
+ "In this instance we are going to `join` two df's by [`joining key columns on an index`](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#joining-key-columns-on-an-index). To implement a `join` on an index we must create the index we wish to use in the second df, then we have to tell `join` to use those columns for matching.\n",
+ "This will create a df which includes the 'Kind' column i.e. we are repeating the flavours. To display only the columns in which we are interested -"
+ "The new table `rated` allows us to work out the price per star, which you can think of as an informal measure of value. Low values are good – they mean that you are paying less for each rating star."
+ "Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn't cost a lot per star."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Side note.** Does the order we list the two tables matter? Let's try it. As you see it, this changes the order that the columns appear in, and can potentially changes the order of the rows, but it doesn't make any fundamental difference."
+ "Also note that the join will only contain information about items that appear in both tables. Let's see an example. Suppose there is a table of reviews of some ice cream cones, and we have found the average or `mean` of reviews for each flavor."
+ "Notice how the strawberry cones have disappeared. None of the reviews are for strawberry cones, so there is nothing to which the `strawberry` rows can be joined. This might be a problem, or it might not be - that depends on the analysis we are trying to perform with the joined table."
+ "Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student's time to completion of degree, while another has data about the student's tuition and financial aid.\n",
+ "\n",
+ "To understand the *students'* experience, it may be helpful to put the two datasets together. If the data are in two tables, each with one row per student, then we would want to put the columns together, making sure to match the rows so that each student's information remains on a single row.\n",
+ "\n",
+ "Let us do this in the context of a simple example, and then use the method with a larger dataset.\n",
+ "Each of the tables has a column that contains ice cream flavors: `cones` has the column `Flavor`, and `ratings` has the column `Kind`. The entries in these columns can be used to link the two tables.\n",
+ "\n",
+ "The method `join` creates a new table in which each cone in the `cones` table is augmented with the Stars information in the `ratings` table. For each cone in `cones`, `join` finds a row in `ratings` whose `Kind` matches the cone's `Flavor`. \n",
+ "\n",
+ "In this instance we are going to `join` two df's by [`joining key columns on an index`](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#joining-key-columns-on-an-index). To implement a `join` on an index we must create the index we wish to use in the second df, then we have to tell `join` to use those columns for matching.\n",
+ "This will create a df which includes the 'Kind' column i.e. we are repeating the flavours. To display only the columns in which we are interested -"
+ "The new table `rated` allows us to work out the price per star, which you can think of as an informal measure of value. Low values are good – they mean that you are paying less for each rating star."
+ "Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn't cost a lot per star."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Side note.** Does the order we list the two tables matter? Let's try it. As you see it, this changes the order that the columns appear in, and can potentially changes the order of the rows, but it doesn't make any fundamental difference."
+ "Also note that the join will only contain information about items that appear in both tables. Let's see an example. Suppose there is a table of reviews of some ice cream cones, and we have found the average or `mean` of reviews for each flavor."
+ "Notice how the strawberry cones have disappeared. None of the reviews are for strawberry cones, so there is nothing to which the `strawberry` rows can be joined. This might be a problem, or it might not be - that depends on the analysis we are trying to perform with the joined table."
+ "(**N.B.** when using the term 'Table(s)' we are referring to DataFrames)\n",
+ "\n",
+ "We are building up a useful inventory of techniques for identifying patterns and themes in a data set by using functions already available in Python. We will now explore a core feature of the Python programming language: function definition.\n",
+ "\n",
+ "We have used functions extensively already in this text, but never defined a function of our own. The purpose of defining a function is to give a name to a computational process that may be applied multiple times. There are many situations in computing that require repeated computation. For example, it is often the case that we want to perform the same manipulation on every value in a column of a table."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Defining a Function\n",
+ "The definition of the `double` function below simply doubles a number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Our first function definition\n",
+ "\n",
+ "def double(x):\n",
+ " \"\"\" Double x \"\"\"\n",
+ " return 2*x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We start any function definition by writing `def`. Here is a breakdown of the other parts (the *syntax*) of this small function:\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When we run the cell above, no particular number is doubled, and the code inside the body of `double` is not yet evaluated. In this respect, our function is analogous to a *recipe*. Each time we follow the instructions in a recipe, we need to start with ingredients. Each time we want to use our function to double a number, we need to specify a number.\n",
+ "\n",
+ "We can call `double` in exactly the same way we have called other functions. Each time we do that, the code in the body is executed, with the value of the argument given the name `x`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "34"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "double(17)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "-0.3"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "double(-0.6/4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The two expressions above are both *call expressions*. In the second one, the value of the expression `-0.6/4` is computed and then passed as the argument named `x` to the `double` function. Each call expresson results in the body of `double` being executed, but with a different value of `x`.\n",
+ "\n",
+ "The body of `double` has only a single line:\n",
+ "\n",
+ "`return 2*x`\n",
+ "\n",
+ "Executing this *`return` statement* completes execution of the `double` function's body and computes the value of the call expression.\n",
+ "\n",
+ "The argument to `double` can be any expression, as long as its value is a number. For example, it can be a name. The `double` function does not know or care how its argument is computed or stored; its only job is to execute its own body using the values of the arguments passed to it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "84"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "any_name = 42\n",
+ "double(any_name)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The argument can also be any value that can be doubled. For example, a whole array of numbers can be passed as an argument to `double`, and the result will be another array."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([ 6, 8, 10])"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "double(np.array([3, 4, 5]))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, names that are defined inside a function, including arguments like `double`'s `x`, have only a fleeting existence. They are defined only while the function is being called, and they are only accessible inside the body of the function. We can't refer to `x` outside the body of `double`. The technical terminology is that `x` has *local scope*.\n",
+ "\n",
+ "Therefore the name `x` isn't recognized outside the body of the function, even though we have called `double` in the cells above."
+ "\u001b[0;32m<ipython-input-34-6fcf9dfbd479>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;31mNameError\u001b[0m: name 'x' is not defined"
+ ]
+ }
+ ],
+ "source": [
+ "x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Docstrings.** Though `double` is relatively easy to understand, many functions perform complicated tasks and are difficult to use without explanation. (You may have discovered this yourself!) Therefore, a well-composed function has a name that evokes its behavior, as well as documentation. In Python, this is called a *docstring* — a description of its behavior and expectations about its arguments. The docstring can also show example calls to the function, where the call is preceded by `>>>`.\n",
+ "\n",
+ "A docstring can be any string, as long as it is the first thing in a function's body. Docstrings are typically defined using triple quotation marks at the start and end, which allows a string to span multiple lines. The first line is conventionally a complete but short description of the function, while following lines provide further guidance to future users of the function.\n",
+ "\n",
+ "Here is a definition of a function called `percent` that takes two arguments. The definition includes a docstring."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# A function with more than one argument\n",
+ "\n",
+ "def percent(x, total):\n",
+ " \"\"\"Convert x to a percentage of total.\n",
+ " \n",
+ " More precisely, this function divides x by total,\n",
+ " multiplies the result by 100, and rounds the result\n",
+ " to two decimal places.\n",
+ " \n",
+ " >>> percent(4, 16)\n",
+ " 25.0\n",
+ " >>> percent(1, 6)\n",
+ " 16.67\n",
+ " \"\"\"\n",
+ " return round((x/total)*100, 2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "16.5"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "percent(33, 200)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Contrast the function `percent` defined above with the function `percents` defined below. The latter takes an array as its argument, and converts all the numbers in the array to percents out of the total of the values in the array. The percents are all rounded to two decimal places, this time replacing `round` by `np.round` because the argument is an array and not a number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def percents(counts):\n",
+ " \"\"\"Convert the values in array_x to percents out of the total of array_x.\"\"\"\n",
+ " total = counts.sum()\n",
+ " return np.round((counts/total)*100, 2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The function `percents` returns an array of percents that add up to 100 apart from rounding."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([33.33, 47.62, 19.05])"
+ ]
+ },
+ "execution_count": 38,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "some_array = np.array([7, 10, 4])\n",
+ "percents(some_array)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It is helpful to understand the steps Python takes to execute a function. To facilitate this, we have put a function definition and a call to that function in the same cell below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The biggest difference is 5\n"
+ ]
+ }
+ ],
+ "source": [
+ "def biggest_difference(array_x):\n",
+ " \"\"\"Find the biggest difference in absolute value between two adjacent elements of array_x.\"\"\"\n",
+ "There can be multiple ways to generalize an expression or block of code, and so a function can take multiple arguments that each determine different aspects of the result. For example, the `percents` function we defined previously rounded to two decimal places every time. The following two-argument definition allows different calls to round to different amounts."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Rounded to 1 decimal place: [28.6 14.3 57.1]\n",
+ "Rounded to 2 decimal places: [28.57 14.29 57.14]\n",
+ "Rounded to 3 decimal places: [28.571 14.286 57.143]\n"
+ ]
+ }
+ ],
+ "source": [
+ "def percents(counts, decimal_places):\n",
+ " \"\"\"Convert the values in array_x to percents out of the total of array_x.\"\"\"\n",
+ "print(\"Rounded to 1 decimal place: \", percents(parts, 1))\n",
+ "print(\"Rounded to 2 decimal places:\", percents(parts, 2))\n",
+ "print(\"Rounded to 3 decimal places:\", percents(parts, 3))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The flexibility of this new definition comes at a small price: each time the function is called, the number of decimal places must be specified. Default argument values allow a function to be called with a variable number of arguments; any argument that isn't specified in the call expression is given its default value, which is stated in the first line of the `def` statement. For example, in this final definition of `percents`, the optional argument `decimal_places` is given a default value of 2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Rounded to 1 decimal place: [28.6 14.3 57.1]\n",
+ "Rounded to the default number of decimal places: [28.57 14.29 57.14]\n"
+ ]
+ }
+ ],
+ "source": [
+ "def percents(counts, decimal_places=2):\n",
+ " \"\"\"Convert the values in array_x to percents out of the total of array_x.\"\"\"\n",
+ "print(\"Rounded to 1 decimal place:\", percents(parts, 1))\n",
+ "print(\"Rounded to the default number of decimal places:\", percents(parts))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Note: Methods\n",
+ "Functions are called by placing argument expressions in parentheses after the function name. Any function that is defined in isolation is called in this way. You have also seen examples of methods, which are like functions but are called using dot notation, such as `some_table.sort_values(some_label)`. The functions that you define will always be called using the function name first, passing in all of the arguments. \n",
+ "\n",
+ "**N.B.** remember - a table is another name for a df"
+ "In many situations, actions and results depends on a specific set of conditions being satisfied. For example, individuals in randomized controlled trials receive the treatment if they have been assigned to the treatment group. A gambler makes money if she wins her bet. \n",
+ "\n",
+ "In this section we will learn how to describe such situations using code. A *conditional statement* is a multi-line statement that allows Python to choose among different alternatives based on the truth value of an expression. While conditional statements can appear anywhere, they appear most often within the body of a function in order to express alternative behavior depending on argument values.\n",
+ "\n",
+ "A conditional statement always begins with an `if` header, which is a single line followed by an indented body. The body is only executed if the expression directly following `if` (called the *if expression*) evaluates to a true value. If the *if expression* evaluates to a false value, then the body of the `if` is skipped.\n",
+ "\n",
+ "Let us start defining a function that returns the sign of a number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Positive'"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This function returns the correct sign if the input is a positive number. But if the input is not a positive number, then the *if expression* evaluates to a false value, and so the `return` statement is skipped and the function call has no value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sign(-3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "So let us refine our function to return `Negative` if the input is a negative number. We can do this by adding an `elif` clause, where `elif` if Python's shorthand for the phrase \"else, if\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now `sign` returns the correct answer when the input is -3:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Negative'"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(-3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What if the input is 0? To deal with this case, we can add another `elif` clause:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'\n",
+ " \n",
+ " elif x == 0:\n",
+ " return 'Neither positive nor negative'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Neither positive nor negative'"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Equivalently, we can replaced the final `elif` clause by an `else` clause, whose body will be executed only if all the previous comparisons are false; that is, if the input value is equal to 0."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'\n",
+ " \n",
+ " else:\n",
+ " return 'Neither positive nor negative'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Neither positive nor negative'"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The General Form ###\n",
+ "A conditional statement can also have multiple clauses with multiple bodies, and only one of those bodies can ever be executed. The general format of a multi-clause conditional statement appears below.\n",
+ "\n",
+ " if <if expression>:\n",
+ " <if body>\n",
+ " elif <elif expression 0>:\n",
+ " <elif body 0>\n",
+ " elif <elif expression 1>:\n",
+ " <elif body 1>\n",
+ " ...\n",
+ " else:\n",
+ " <else body>\n",
+ " \n",
+ "There is always exactly one `if` clause, but there can be any number of `elif` clauses. Python will evaluate the `if` and `elif` expressions in the headers in order until one is found that is a true value, then execute the corresponding body. The `else` clause is optional. When an `else` header is provided, its *else body* is executed only if none of the header expressions of the previous clauses are true. The `else` clause must always come at the end (or not at all)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: Betting on a Die\n",
+ "Suppose I bet on a roll of a fair die. The rules of the game:\n",
+ "\n",
+ "- If the die shows 1 spot or 2 spots, I lose a dollar.\n",
+ "- If the die shows 3 spots or 4 spots, I neither lose money nor gain money.\n",
+ "- If the die shows 5 spots or 6 spots, I gain a dollar.\n",
+ "\n",
+ "We will now use conditional statements to define a function `one_bet` that takes the number of spots on the roll and returns my net gain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def one_bet(x):\n",
+ " \"\"\"Returns my net gain if the die shows x spots\"\"\"\n",
+ " if x <= 2:\n",
+ " return -1\n",
+ " elif x <= 4:\n",
+ " return 0\n",
+ " elif x <= 6:\n",
+ " return 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's check that the function does the right thing for each different number of spots."
+ "As a review of how conditional statements work, let's see what `one_bet` does when the input is 3.\n",
+ "\n",
+ "- First it evaluates the `if` expression, which is `3 <= 2` which is `False`. So `one_bet` doesn't execute the `if` body.\n",
+ "- Then it evaluates the first `elif` expression, which is `3 <= 4`, which is `True`. So `one_bet` executes the first `elif` body and returns 0.\n",
+ "- Once the body has been executed, the process is complete. The next `elif` expression is not evaluated.\n",
+ "\n",
+ "If for some reason we use an input greater than 6, then the `if` expression evaluates to `False` as do both of the `elif` expressions. So `one_bet` does not execute the `if` body nor the two `elif` bodies, and there is no value when you make the call below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "one_bet(17)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To play the game based on one roll of a die, you can use `np.random.choice` to generate the number of spots and then use that as the argument to `one_bet`. Run the cell a few times to see how the output changes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "one_bet(np.random.choice(np.arange(1, 7)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "At this point it is natural to want to collect the results of all the bets so that we can analyze them. In the next section we develop a way to do this without running the cell over and over again."
+ "In many situations, actions and results depends on a specific set of conditions being satisfied. For example, individuals in randomized controlled trials receive the treatment if they have been assigned to the treatment group. A gambler makes money if she wins her bet. \n",
+ "\n",
+ "In this section we will learn how to describe such situations using code. A *conditional statement* is a multi-line statement that allows Python to choose among different alternatives based on the truth value of an expression. While conditional statements can appear anywhere, they appear most often within the body of a function in order to express alternative behavior depending on argument values.\n",
+ "\n",
+ "A conditional statement always begins with an `if` header, which is a single line followed by an indented body. The body is only executed if the expression directly following `if` (called the *if expression*) evaluates to a true value. If the *if expression* evaluates to a false value, then the body of the `if` is skipped.\n",
+ "\n",
+ "Let us start defining a function that returns the sign of a number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Positive'"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This function returns the correct sign if the input is a positive number. But if the input is not a positive number, then the *if expression* evaluates to a false value, and so the `return` statement is skipped and the function call has no value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sign(-3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "So let us refine our function to return `Negative` if the input is a negative number. We can do this by adding an `elif` clause, where `elif` if Python's shorthand for the phrase \"else, if\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now `sign` returns the correct answer when the input is -3:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Negative'"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(-3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "What if the input is 0? To deal with this case, we can add another `elif` clause:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'\n",
+ " \n",
+ " elif x == 0:\n",
+ " return 'Neither positive nor negative'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Neither positive nor negative'"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Equivalently, we can replaced the final `elif` clause by an `else` clause, whose body will be executed only if all the previous comparisons are false; that is, if the input value is equal to 0."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def sign(x):\n",
+ " \n",
+ " if x > 0:\n",
+ " return 'Positive'\n",
+ " \n",
+ " elif x < 0:\n",
+ " return 'Negative'\n",
+ " \n",
+ " else:\n",
+ " return 'Neither positive nor negative'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'Neither positive nor negative'"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sign(0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The General Form ###\n",
+ "A conditional statement can also have multiple clauses with multiple bodies, and only one of those bodies can ever be executed. The general format of a multi-clause conditional statement appears below.\n",
+ "\n",
+ " if <if expression>:\n",
+ " <if body>\n",
+ " elif <elif expression 0>:\n",
+ " <elif body 0>\n",
+ " elif <elif expression 1>:\n",
+ " <elif body 1>\n",
+ " ...\n",
+ " else:\n",
+ " <else body>\n",
+ " \n",
+ "There is always exactly one `if` clause, but there can be any number of `elif` clauses. Python will evaluate the `if` and `elif` expressions in the headers in order until one is found that is a true value, then execute the corresponding body. The `else` clause is optional. When an `else` header is provided, its *else body* is executed only if none of the header expressions of the previous clauses are true. The `else` clause must always come at the end (or not at all)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example: Betting on a Die\n",
+ "Suppose I bet on a roll of a fair die. The rules of the game:\n",
+ "\n",
+ "- If the die shows 1 spot or 2 spots, I lose a dollar.\n",
+ "- If the die shows 3 spots or 4 spots, I neither lose money nor gain money.\n",
+ "- If the die shows 5 spots or 6 spots, I gain a dollar.\n",
+ "\n",
+ "We will now use conditional statements to define a function `one_bet` that takes the number of spots on the roll and returns my net gain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def one_bet(x):\n",
+ " \"\"\"Returns my net gain if the die shows x spots\"\"\"\n",
+ " if x <= 2:\n",
+ " return -1\n",
+ " elif x <= 4:\n",
+ " return 0\n",
+ " elif x <= 6:\n",
+ " return 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's check that the function does the right thing for each different number of spots."
+ "As a review of how conditional statements work, let's see what `one_bet` does when the input is 3.\n",
+ "\n",
+ "- First it evaluates the `if` expression, which is `3 <= 2` which is `False`. So `one_bet` doesn't execute the `if` body.\n",
+ "- Then it evaluates the first `elif` expression, which is `3 <= 4`, which is `True`. So `one_bet` executes the first `elif` body and returns 0.\n",
+ "- Once the body has been executed, the process is complete. The next `elif` expression is not evaluated.\n",
+ "\n",
+ "If for some reason we use an input greater than 6, then the `if` expression evaluates to `False` as do both of the `elif` expressions. So `one_bet` does not execute the `if` body nor the two `elif` bodies, and there is no value when you make the call below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "one_bet(17)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To play the game based on one roll of a die, you can use `np.random.choice` to generate the number of spots and then use that as the argument to `one_bet`. Run the cell a few times to see how the output changes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "one_bet(np.random.choice(np.arange(1, 7)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "At this point it is natural to want to collect the results of all the bets so that we can analyze them. In the next section we develop a way to do this without running the cell over and over again."
+ "In the previous chapters we developed skills needed to make insightful *descriptions* of data. Data scientists also have to be able to understand **randomness**. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.\n",
+ "\n",
+ "In this chapter, we begin our analysis of randomness. To start off, we will use Python to make choices at random. In `numpy` there is a sub-module called `random` that contains many functions that involve random selection. One of these functions is called `choice`. It picks one item at random from an array, and it is equally likely to pick any of the items. The function call is `np.random.choice(array_name)`, where `array_name` is the name of the array from which to make the choice.\n",
+ "\n",
+ "Thus the following code evaluates to `treatment` with chance 50%, and `control` with chance 50%."
+ "The big difference between the code above and all the other code we have run thus far is that the code above doesn't always return the same value. It can return either `treatment` or `control`, and we don't know ahead of time which one it will pick. We can repeat the process by providing a second argument, the number of times to repeat the process."
+ "A fundamental question about random events is whether or not they occur. For example:\n",
+ "\n",
+ "- Did an individual get assigned to the treatment group, or not?\n",
+ "- Is a gambler going to win money, or not?\n",
+ "- Has a poll made an accurate prediction, or not?\n",
+ "\n",
+ "Once the event has occurred, you can answer \"yes\" or \"no\" to all these questions. In programming, it is conventional to do this by labeling statements as True or False. For example, if an individual did get assigned to the treatment group, then the statement, \"The individual was assigned to the treatment group\" would be `True`. If not, it would be `False`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Booleans and Comparison\n",
+ "\n",
+ "In Python, Boolean values, named for the logician [George Boole](https://en.wikipedia.org/wiki/George_Boole), represent truth and take only two possible values: `True` and `False`. Whether problems involve randomness or not, Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, `3` is larger than `1 + 1`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "3 > 1 + 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The value `True` indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between `3` and `1+1`. The full set of common comparison operators are listed below.\n",
+ "\n",
+ "| Comparison | Operator | True example | False Example |\n",
+ "Notice the two equal signs `==` in the comparison to determine equality. This is necessary because Python already uses `=` to mean assignment to a name, as we have seen. It can't use the same symbol for a different purpose. Thus if you want to check whether 5 is equal to the 10/2, then you have to be careful: `5 = 10/2` returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is the numeral 5. Instead, you must use `5 == 10/2`, which evaluates to `True`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "tags": [
+ "raises-exception"
+ ]
+ },
+ "outputs": [
+ {
+ "ename": "SyntaxError",
+ "evalue": "cannot assign to literal (<ipython-input-5-e8c755f5e450>, line 1)",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;36m File \u001b[0;32m\"<ipython-input-5-e8c755f5e450>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m 5 = 10/2\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m cannot assign to literal\n"
+ ]
+ }
+ ],
+ "source": [
+ "5 = 10/2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "5 == 10/2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be `True`. For example, we can express that `1+1` is between `1` and `3` using the following expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "1 < 1 + 1 < 3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers `x` and `y` below. You can try different values of `x` and `y` to confirm this relationship."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = 12\n",
+ "y = 5\n",
+ "min(x, y) <= (x+y)/2 <= max(x, y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Comparing Strings\n",
+ "\n",
+ "Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "'Dog' > 'Catastrophe' > 'Cat'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's return to random selection. Recall the array `two_groups` which consists of just two elements, `treatment` and `control`. To see whether a randomly assigned individual went to the treatment group, you can use a comparison:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.random.choice(two_groups) == 'treatment'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As before, the random choice will not always be the same, so the result of the comparison won't always be the same either. It will depend on whether `treatment` or `control` was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the variability in the result."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Comparing an Array and a Value\n",
+ "Recall that we can perform arithmetic operations on many numbers in an array at once. For example, `make_array(0, 5, 2)*2` is equivalent to `make_array(0, 10, 4)`. In similar fashion, if we compare an array and one value, each element of the array is compared to that value, and the comparison evaluates to an array of Booleans."