{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "%matplotlib inline\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling and Empirical Distributions ###\n", "An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.\n", "\n", "In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples. \n", "\n", "Let's start by drawing some samples. Our examples are based on the top_movies.csv data set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleStudioGrossGross (Adjusted)YearRow Index
0Star Wars: The Force AwakensBuena Vista (Disney)90672341890672340020150
1AvatarFox76050762584612080020091
2TitanicParamount658672302117862790019972
3Jurassic WorldUniversal65227062568772800020153
4Marvel's The AvengersBuena Vista (Disney)62335791066886660020124
\n", "
" ], "text/plain": [ " Title Studio Gross \\\n", "0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 \n", "1 Avatar Fox 760507625 \n", "2 Titanic Paramount 658672302 \n", "3 Jurassic World Universal 652270625 \n", "4 Marvel's The Avengers Buena Vista (Disney) 623357910 \n", "\n", " Gross (Adjusted) Year Row Index \n", "0 906723400 2015 0 \n", "1 846120800 2009 1 \n", "2 1178627900 1997 2 \n", "3 687728000 2015 3 \n", "4 668866600 2012 4 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_raw = pd.read_csv(path_data + 'top_movies.csv')\n", "\n", "top1 = top_raw.copy()\n", "\n", "top1['Row Index'] = np.arange(len(top1))\n", "\n", "top1.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column Position ###\n", "Notice that column we have created 'Row Index' is positioned last in the df, to make life easier we would like this column to be first in the df. There are several ways in which we can move the position of this column e.g. we could [`pop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) the column out of the df then re-insert it in the desired position or we could [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the column then re-insert into the df. Yet another method would be `insert` the column ['Row Index'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html) at the desired df position.\n", "\n", "[Pandas 'pop'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html)\n", "\n", "[Pandas 'drop'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)\n", "\n", "[Pandas 'insert'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Insert ###" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row IndexTitleStudioGrossGross (Adjusted)Year
00Star Wars: The Force AwakensBuena Vista (Disney)9067234189067234002015
11AvatarFox7605076258461208002009
22TitanicParamount65867230211786279001997
33Jurassic WorldUniversal6522706256877280002015
44Marvel's The AvengersBuena Vista (Disney)6233579106688666002012
.....................
195195The Caine MutinyColumbia217500003861735001954
196196The Bells of St. Mary'sRKO213333335458824001945
197197Duel in the SunSelz.204081634438775001946
198198Sergeant YorkWarner Bros.163618854186718001941
199199The Four Horsemen of the ApocalypseMPC91836733994898001921
\n", "

200 rows × 6 columns

\n", "
" ], "text/plain": [ " Row Index Title Studio \\\n", "0 0 Star Wars: The Force Awakens Buena Vista (Disney) \n", "1 1 Avatar Fox \n", "2 2 Titanic Paramount \n", "3 3 Jurassic World Universal \n", "4 4 Marvel's The Avengers Buena Vista (Disney) \n", ".. ... ... ... \n", "195 195 The Caine Mutiny Columbia \n", "196 196 The Bells of St. Mary's RKO \n", "197 197 Duel in the Sun Selz. \n", "198 198 Sergeant York Warner Bros. \n", "199 199 The Four Horsemen of the Apocalypse MPC \n", "\n", " Gross Gross (Adjusted) Year \n", "0 906723418 906723400 2015 \n", "1 760507625 846120800 2009 \n", "2 658672302 1178627900 1997 \n", "3 652270625 687728000 2015 \n", "4 623357910 668866600 2012 \n", ".. ... ... ... \n", "195 21750000 386173500 1954 \n", "196 21333333 545882400 1945 \n", "197 20408163 443877500 1946 \n", "198 16361885 418671800 1941 \n", "199 9183673 399489800 1921 \n", "\n", "[200 rows x 6 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top2 = top1.drop(columns=['Row Index'])\n", "\n", "top2.insert(0, 'Row Index', np.arange(len(top2)))\n", "\n", "top2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rename Index ###\n", "Rather than creating a new column we can rename the existing axis, howeverby doing this we must remember that 'Row Index' is the actual df index and not simple 'column'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row IndexTitleStudioGrossGross (Adjusted)Year
0Star Wars: The Force AwakensBuena Vista (Disney)9067234189067234002015
1AvatarFox7605076258461208002009
2TitanicParamount65867230211786279001997
3Jurassic WorldUniversal6522706256877280002015
4Marvel's The AvengersBuena Vista (Disney)6233579106688666002012
..................
195The Caine MutinyColumbia217500003861735001954
196The Bells of St. Mary'sRKO213333335458824001945
197Duel in the SunSelz.204081634438775001946
198Sergeant YorkWarner Bros.163618854186718001941
199The Four Horsemen of the ApocalypseMPC91836733994898001921
\n", "

200 rows × 5 columns

\n", "
" ], "text/plain": [ "Row Index Title Studio \\\n", "0 Star Wars: The Force Awakens Buena Vista (Disney) \n", "1 Avatar Fox \n", "2 Titanic Paramount \n", "3 Jurassic World Universal \n", "4 Marvel's The Avengers Buena Vista (Disney) \n", ".. ... ... \n", "195 The Caine Mutiny Columbia \n", "196 The Bells of St. Mary's RKO \n", "197 Duel in the Sun Selz. \n", "198 Sergeant York Warner Bros. \n", "199 The Four Horsemen of the Apocalypse MPC \n", "\n", "Row Index Gross Gross (Adjusted) Year \n", "0 906723418 906723400 2015 \n", "1 760507625 846120800 2009 \n", "2 658672302 1178627900 1997 \n", "3 652270625 687728000 2015 \n", "4 623357910 668866600 2012 \n", ".. ... ... ... \n", "195 21750000 386173500 1954 \n", "196 21333333 545882400 1945 \n", "197 20408163 443877500 1946 \n", "198 16361885 418671800 1941 \n", "199 9183673 399489800 1921 \n", "\n", "[200 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top = top1.drop(columns=['Row Index'])\n", "\n", "top = top.rename_axis('Row Index', axis='columns')\n", "\n", "top" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number Formatting ###\n", "Before going on to process the data we may wish to adjust the format of data (as we have previously). To achieve this we can employ Pandas 'Display Values' which allows us to `format` an entire df or to specific columns.\n", "\n", "[Pandas format](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Finer-Control:-Display-Values)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row Index Title Studio Gross Gross (Adjusted) Year
0Star Wars: The Force AwakensBuena Vista (Disney)906,723,418906,723,4002015
1AvatarFox760,507,625846,120,8002009
2TitanicParamount658,672,3021,178,627,9001997
3Jurassic WorldUniversal652,270,625687,728,0002015
4Marvel's The AvengersBuena Vista (Disney)623,357,910668,866,6002012
5The Dark KnightWarner Bros.534,858,444647,761,6002008
6Star Wars: Episode I - The Phantom MenaceFox474,544,677785,715,0001999
7Star WarsFox460,998,0071,549,640,5001977
8Avengers: Age of UltronBuena Vista (Disney)459,005,868465,684,2002015
9The Dark Knight RisesWarner Bros.448,139,099500,961,7002012
" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top.head(10).style.format({'Gross': \"{:,}\", 'Gross (Adjusted)': '{:,}'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling Rows of a Table ###\n", "Each row of a data table represents an individual; in `top`, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.\n", "\n", "The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deterministic Samples ###\n", "\n", "When you simply specify which elements of a set you want to choose, without any chances involved, you create a ***deterministic*** *sample*.\n", "\n", "You have done this many times, for example by using [`df.iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) (**i**ndex **loc**ation and the df index values `[ ]`:\n", "\n", "[Pandas iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row IndexTitleStudioGrossGross (Adjusted)Year
3Jurassic WorldUniversal6522706256877280002015
18Spider-ManSony4037063756045173002002
100Gone with the WindMGM19867645917577882001939
\n", "
" ], "text/plain": [ "Row Index Title Studio Gross Gross (Adjusted) Year\n", "3 Jurassic World Universal 652270625 687728000 2015\n", "18 Spider-Man Sony 403706375 604517300 2002\n", "100 Gone with the Wind MGM 198676459 1757788200 1939" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top.iloc[np.array([3, 18, 100])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use Pandas `contains` as a conditional operator:\n", "\n", "[Pandas where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row IndexTitleStudioGrossGross (Adjusted)Year
22Harry Potter and the Deathly Hallows Part 2Warner Bros.3810112194175122002011
43Harry Potter and the Sorcerer's StoneWarner Bros.3175755504864429002001
54Harry Potter and the Half-Blood PrinceWarner Bros.3019591973520988002009
59Harry Potter and the Order of the PhoenixWarner Bros.2920047383692502002007
62Harry Potter and the Goblet of FireWarner Bros.2900130363930248002005
69Harry Potter and the Chamber of SecretsWarner Bros.2619884823907681002002
76Harry Potter and the Prisoner of AzkabanWarner Bros.2495410693495986002004
\n", "
" ], "text/plain": [ "Row Index Title Studio \\\n", "22 Harry Potter and the Deathly Hallows Part 2 Warner Bros. \n", "43 Harry Potter and the Sorcerer's Stone Warner Bros. \n", "54 Harry Potter and the Half-Blood Prince Warner Bros. \n", "59 Harry Potter and the Order of the Phoenix Warner Bros. \n", "62 Harry Potter and the Goblet of Fire Warner Bros. \n", "69 Harry Potter and the Chamber of Secrets Warner Bros. \n", "76 Harry Potter and the Prisoner of Azkaban Warner Bros. \n", "\n", "Row Index Gross Gross (Adjusted) Year \n", "22 381011219 417512200 2011 \n", "43 317575550 486442900 2001 \n", "54 301959197 352098800 2009 \n", "59 292004738 369250200 2007 \n", "62 290013036 393024800 2005 \n", "69 261988482 390768100 2002 \n", "76 249541069 349598600 2004 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top[top['Title'].str.contains('Harry Potter')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While these are samples, they are not random samples. They don't involve chance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probability Samples\n", "------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For describing random samples, some terminology will be helpful.\n", "\n", "A *population* is the set of all elements from whom a sample will be drawn.\n", "\n", "A *probability sample* is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.\n", "\n", "In a probability sample, all elements need not have the same chance of being chosen. \n", "\n", "### A Random Sampling Scheme ###\n", "\n", "For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:\n", "\n", "- Person A is chosen with probability 1.\n", "- One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.\n", "\n", "This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:\n", "\n", " A: 1 \n", " B: 1/2\n", " C: 1/2\n", " AB: 1/2\n", " AC: 1/2\n", " BC: 0\n", " ABC: 0\n", "\n", "Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Systematic Sample ###\n", "\n", "Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a *systematic sample*. \n", "\n", "Here we will choose a systematic sample of the rows of `top`. We will start by picking one of the first 10 rows at random, and then we will pick applying the `take` method every 10th row after that. \n", "\n", "[Pandas take](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.take.html)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Row IndexTitleStudioGrossGross (Adjusted)Year
2TitanicParamount65867230211786279001997
12The Hunger Games: Catching FireLionsgate4246680474446974002013
22Harry Potter and the Deathly Hallows Part 2Warner Bros.3810112194175122002011
32American SniperWarner Bros.3501263723747960002014
42Iron ManParamount3184121013858081002008
52SkyfallSony3043602773292254002012
62Harry Potter and the Goblet of FireWarner Bros.2900130363930248002005
72JawsUniversal26000000011142857001975
82TwisterWarner Bros.2417215244757867001996
92GhostParamount2176313064477474001990
102Toy StoryBuena Vista (Disney)1917962333816544001995
112Pretty WomanBuena Vista (Disney)1784062683669349001990
122Batman ReturnsWarner Bros.1628316983413580001992
132101 DalmatiansDisney1448800148692801001961
142On Golden PondUniversal1192854323530837001981
152Kramer Vs. KramerColumbia1062600003742761001979
162Cinderella (1950)Disney931411495470502001950
172My Fair LadyWarner Bros.720000005220000001964
182GoldfingerUA510810625768100001964
192The Bridge on the River KwaiColumbia272000004732800001957
\n", "
" ], "text/plain": [ "Row Index Title Studio \\\n", "2 Titanic Paramount \n", "12 The Hunger Games: Catching Fire Lionsgate \n", "22 Harry Potter and the Deathly Hallows Part 2 Warner Bros. \n", "32 American Sniper Warner Bros. \n", "42 Iron Man Paramount \n", "52 Skyfall Sony \n", "62 Harry Potter and the Goblet of Fire Warner Bros. \n", "72 Jaws Universal \n", "82 Twister Warner Bros. \n", "92 Ghost Paramount \n", "102 Toy Story Buena Vista (Disney) \n", "112 Pretty Woman Buena Vista (Disney) \n", "122 Batman Returns Warner Bros. \n", "132 101 Dalmatians Disney \n", "142 On Golden Pond Universal \n", "152 Kramer Vs. Kramer Columbia \n", "162 Cinderella (1950) Disney \n", "172 My Fair Lady Warner Bros. \n", "182 Goldfinger UA \n", "192 The Bridge on the River Kwai Columbia \n", "\n", "Row Index Gross Gross (Adjusted) Year \n", "2 658672302 1178627900 1997 \n", "12 424668047 444697400 2013 \n", "22 381011219 417512200 2011 \n", "32 350126372 374796000 2014 \n", "42 318412101 385808100 2008 \n", "52 304360277 329225400 2012 \n", "62 290013036 393024800 2005 \n", "72 260000000 1114285700 1975 \n", "82 241721524 475786700 1996 \n", "92 217631306 447747400 1990 \n", "102 191796233 381654400 1995 \n", "112 178406268 366934900 1990 \n", "122 162831698 341358000 1992 \n", "132 144880014 869280100 1961 \n", "142 119285432 353083700 1981 \n", "152 106260000 374276100 1979 \n", "162 93141149 547050200 1950 \n", "172 72000000 522000000 1964 \n", "182 51081062 576810000 1964 \n", "192 27200000 473280000 1957 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"Choose a random start among rows 0 through 9;\n", "then take every 10th row.\"\"\"\n", "\n", "start = np.random.choice(np.arange(10))\n", "top.take(np.arange(start, len(top), 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the cell a few times to see how the output varies. \n", "\n", "This systematic sample is a probability sample. In this scheme, all rows have chance $1/10$ of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is $1/10$. \n", "\n", "But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, like the subset containing the first 11 rows of the table, are selected with chance 0." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Samples Drawn With or Without Replacement ###\n", "In this course, we will mostly deal with the two most straightforward methods of sampling. \n", "\n", "The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of `np.random.choice` when it samples from an array. \n", "\n", "The other, called a \"simple random sample\", is a sample drawn at random *without* replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example. \n", "\n", "In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement. \n", "\n", "[Numpy random.choice](https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.random.choice.html)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Drawing a random sample requires care and precision. It is not haphazard, even though that is a colloquial meaning of the word \"random\". If you stand at a street corner and take as your sample the first ten people who pass by, you might think you're sampling at random because you didn't choose who walked by. But it's not a random sample – it's a *sample of convenience*. You didn't know ahead of time the probability of each person entering the sample; perhaps you hadn't even specified exactly who was in the population." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" } }, "nbformat": 4, "nbformat_minor": 1 }