{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"\n",
"np.set_printoptions(threshold=50)\n",
"path_data = '../../../data/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DataFrames\n",
"\n",
"DataFrames (df's) are a fundamental object type for representing data sets. A df can be viewed in two ways:\n",
"* a sequence of named columns that each describe a single aspect of all entries in a data set, or\n",
"* a sequence of rows that each contain all information about a single entry in a data set.\n",
"\n",
"In order to use a DataFrame, import all of the module called `pandas`, by convention this is usually imported and as `pd`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Empty tables can be created using the `pd.DataFrame()` function. An empty table is usefuly because it can be extended to contain new rows and columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 0 0 0\n",
"1 0 0 0\n",
"2 0 0 0\n",
"3 0 0 0\n",
"4 0 0 0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(columns=['A', 'B', 'C'], index=(0,1,2,3,4)).fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dictionary is used to construct a new table with labeled columns. Each column of a df is an array.\n",
"\n",
"Below, we begin each example with an empty table that has no columns. \n",
"\n",
"Notice: \n",
"- the column heading is supplied as a `key` with the corresponding column content added as a`value` in a `{key:value}` dictionary\n",
"- when using a list as a data source for the `np.array()` function the list must be placed within square or `hard` brackets. \n",
"- an `index` has been added automatically"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Number of petals | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 34 | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of petals\n",
"0 8\n",
"1 34\n",
"2 5"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame({'Number of petals': np.array([8, 34, 5])})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Number of petals | \n",
" Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" lotus | \n",
"
\n",
" \n",
" 1 | \n",
" 34 | \n",
" sunflower | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" rose | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of petals Name\n",
"0 8 lotus\n",
"1 34 sunflower\n",
"2 5 rose"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame({\n",
" 'Number of petals': np.array([8, 34, 5]),\n",
" 'Name':np.array(['lotus', 'sunflower', 'rose'])\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can give this table a name, and then extend the table with another column."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Number of petals | \n",
" Name | \n",
" Color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" lotus | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
" 1 | \n",
" 34 | \n",
" sunflower | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" rose | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of petals Name Color\n",
"0 8 lotus {pink, red, yellow}\n",
"1 34 sunflower {pink, red, yellow}\n",
"2 5 rose {pink, red, yellow}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"flowers = pd.DataFrame({'Number of petals':np.array([8, 34, 5]),'Name':np.array(['lotus', 'sunflower', 'rose'])})\n",
"\n",
"flowers_two_col = flowers.copy()\n",
"\n",
"flowers['Color'] = np.array({'pink', 'yellow', 'red'})\n",
"\n",
" \n",
"flowers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When a new column is added to a Dataframe a new DatFrame is **not** created, so the original DataFrame is affected. For example, the original DatFrame `flowers` before the third was added."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Number of petals | \n",
" Name | \n",
" Color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" lotus | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
" 1 | \n",
" 34 | \n",
" sunflower | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" rose | \n",
" {pink, red, yellow} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of petals Name Color\n",
"0 8 lotus {pink, red, yellow}\n",
"1 34 sunflower {pink, red, yellow}\n",
"2 5 rose {pink, red, yellow}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"flowers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Before** adding a third column a copy of df 'flowers' is created, in this case the new df created is called flowers_two_col. 'flowers_two_col = flowers`.copy()`'\n",
"\n",
"[Pandas 'df.copy()'](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Number of petals | \n",
" Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" lotus | \n",
"
\n",
" \n",
" 1 | \n",
" 34 | \n",
" sunflower | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" rose | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of petals Name\n",
"0 8 lotus\n",
"1 34 sunflower\n",
"2 5 rose"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"flowers_two_col"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating dfs in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.\n",
"\n",
"Often, dfs are created from files that contain comma-separated values. Such files are called CSV files.\n",
"\n",
"Below, we use the Table method `pd.read_csv()` to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon's Russian campaign. The data are placed in a df named `minard`.\n",
"\n",
"[pd.read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City | \n",
" Direction | \n",
" Survivors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Direction Survivors\n",
"0 32.0 54.8 Smolensk Advance 145000\n",
"1 33.2 54.9 Dorogobouge Advance 140000\n",
"2 34.4 55.5 Chjat Advance 127100\n",
"3 37.6 55.8 Moscou Advance 100000\n",
"4 34.3 55.2 Wixma Retreat 55000\n",
"5 32.0 54.6 Smolensk Retreat 24000\n",
"6 30.4 54.4 Orscha Retreat 20000\n",
"7 26.8 54.3 Moiodexno Retreat 12000"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard = pd.read_csv(path_data + 'minard.csv')\n",
"minard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use this small df to demonstrate some useful DataFrame methods. We will then use those same methods, and develop other methods, on much larger DataFrames."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Size of the Table ###\n",
"\n",
"The method `df.shape(1)` gives the number of columns in the table, and `df.shape(0)` the number of rows.\n",
"\n",
"[df.shape[]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8, 5)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.shape"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.shape[1]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### len( ) ###\n",
"\n",
"The number of rows in a df can also be found by using the `len()` function. For number of rows `len(df.rows)`, and number of columns `len(df.columns)`. As the default parameter for the `len()` function is set for number of rows and if we want to know the number of rows we don't usually add '.rows' "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(minard.columns)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(minard)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Column Labels \n",
"The method `.columns` can be used to list the labels of all the columns. With `minard` we don't gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Longitude', 'Latitude', 'City', 'Direction', 'Survivors'], dtype='object')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can change column labels using the `rename(columns={})` method. This creates a **new** df and leaves `minard` unchanged."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City Name | \n",
" Direction | \n",
" Survivors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Name Direction Survivors\n",
"0 32.0 54.8 Smolensk Advance 145000\n",
"1 33.2 54.9 Dorogobouge Advance 140000\n",
"2 34.4 55.5 Chjat Advance 127100\n",
"3 37.6 55.8 Moscou Advance 100000\n",
"4 34.3 55.2 Wixma Retreat 55000\n",
"5 32.0 54.6 Smolensk Retreat 24000\n",
"6 30.4 54.4 Orscha Retreat 20000\n",
"7 26.8 54.3 Moiodexno Retreat 12000"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.rename(columns={'City':'City Name'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this method does not change the original DataFrame. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City | \n",
" Direction | \n",
" Survivors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Direction Survivors\n",
"0 32.0 54.8 Smolensk Advance 145000\n",
"1 33.2 54.9 Dorogobouge Advance 140000\n",
"2 34.4 55.5 Chjat Advance 127100\n",
"3 37.6 55.8 Moscou Advance 100000\n",
"4 34.3 55.2 Wixma Retreat 55000\n",
"5 32.0 54.6 Smolensk Retreat 24000\n",
"6 30.4 54.4 Orscha Retreat 20000\n",
"7 26.8 54.3 Moiodexno Retreat 12000"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common pattern is to **assign** the original name `minard` to the new table, so that all future uses of `minard` will refer to the relabeled table."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City Name | \n",
" Direction | \n",
" Survivors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Name Direction Survivors\n",
"0 32.0 54.8 Smolensk Advance 145000\n",
"1 33.2 54.9 Dorogobouge Advance 140000\n",
"2 34.4 55.5 Chjat Advance 127100\n",
"3 37.6 55.8 Moscou Advance 100000\n",
"4 34.3 55.2 Wixma Retreat 55000\n",
"5 32.0 54.6 Smolensk Retreat 24000\n",
"6 30.4 54.4 Orscha Retreat 20000\n",
"7 26.8 54.3 Moiodexno Retreat 12000"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard = minard.rename(columns={'City':'City Name'})\n",
"minard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accessing the Data in a Column ###\n",
"We can use a column's label to access the array of data in the column."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 145000\n",
"1 140000\n",
"2 127100\n",
"3 100000\n",
"4 55000\n",
"5 24000\n",
"6 20000\n",
"7 12000\n",
"Name: Survivors, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard['Survivors']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### type( )\n",
"\n",
"To determine the tupe of object created we can use the `type()` function."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(minard)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using two sets of square brackets the output is displayed in DataFrame format."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survivors | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 145000 | \n",
"
\n",
" \n",
" 1 | \n",
" 140000 | \n",
"
\n",
" \n",
" 2 | \n",
" 127100 | \n",
"
\n",
" \n",
" 3 | \n",
" 100000 | \n",
"
\n",
" \n",
" 4 | \n",
" 55000 | \n",
"
\n",
" \n",
" 5 | \n",
" 24000 | \n",
"
\n",
" \n",
" 6 | \n",
" 20000 | \n",
"
\n",
" \n",
" 7 | \n",
" 12000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survivors\n",
"0 145000\n",
"1 140000\n",
"2 127100\n",
"3 100000\n",
"4 55000\n",
"5 24000\n",
"6 20000\n",
"7 12000"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard[['Survivors']]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(minard)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### iLoc[ ]\n",
"\n",
"(index location)\n",
"\n",
"The 5 columns are indexed 0, 1, 2, 3, and 4. The column `Survivors` can also be accessed by using the `iloc[]` method with the required column index. Notice that to select a column using the `iloc[]` method we have to first place a colon followed by a comma in the swuare brackets due to the default setting for `iloc[]` being set to 'rows'.\n",
"\n",
"[Pandas iloc []](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 145000\n",
"1 140000\n",
"2 127100\n",
"3 100000\n",
"4 55000\n",
"5 24000\n",
"6 20000\n",
"7 12000\n",
"Name: Survivors, dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.iloc[:,4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using `item`, as with any array."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"145000"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.iloc[:,4][0]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24000"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.iloc[:,4][5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Alternatively \n",
"\n",
"if we wish to find a particular member of a row we select a row rather than a column. Notice that in this instance we have selected the 4th row and the 4th column, remembering that though there are 5 columns Pandas refers to the first column as column 0 and first row as row 0."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24000"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.iloc[5][4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with the Data in a Column ###\n",
"Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City Name | \n",
" Direction | \n",
" Survivors | \n",
" Percent Surviving | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
" 0.965517 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
" 0.876552 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
" 0.689655 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
" 0.379310 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
" 0.165517 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
" 0.137931 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
" 0.082759 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Name Direction Survivors Percent Surviving\n",
"0 32.0 54.8 Smolensk Advance 145000 1.000000\n",
"1 33.2 54.9 Dorogobouge Advance 140000 0.965517\n",
"2 34.4 55.5 Chjat Advance 127100 0.876552\n",
"3 37.6 55.8 Moscou Advance 100000 0.689655\n",
"4 34.3 55.2 Wixma Retreat 55000 0.379310\n",
"5 32.0 54.6 Smolensk Retreat 24000 0.165517\n",
"6 30.4 54.4 Orscha Retreat 20000 0.137931\n",
"7 26.8 54.3 Moiodexno Retreat 12000 0.082759"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"initial = minard['Survivors'][0]\n",
"\n",
"minard['Percent Surviving'] = minard['Survivors']/initial\n",
"\n",
"minard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Formatting\n",
"\n",
"To make the proportions in the new columns appear as percents, we can use the method `style.format()` with the option. \n",
"\n",
"[style.format()](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | Longitude | Latitude | City Name | Direction | Survivors | Percent Surviving |
\n",
" \n",
" 0 | \n",
" 32.000000 | \n",
" 54.800000 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
" 100.00% | \n",
"
\n",
" \n",
" 1 | \n",
" 33.200000 | \n",
" 54.900000 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
" 96.55% | \n",
"
\n",
" \n",
" 2 | \n",
" 34.400000 | \n",
" 55.500000 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
" 87.66% | \n",
"
\n",
" \n",
" 3 | \n",
" 37.600000 | \n",
" 55.800000 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
" 68.97% | \n",
"
\n",
" \n",
" 4 | \n",
" 34.300000 | \n",
" 55.200000 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
" 37.93% | \n",
"
\n",
" \n",
" 5 | \n",
" 32.000000 | \n",
" 54.600000 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
" 16.55% | \n",
"
\n",
" \n",
" 6 | \n",
" 30.400000 | \n",
" 54.400000 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
" 13.79% | \n",
"
\n",
" \n",
" 7 | \n",
" 26.800000 | \n",
" 54.300000 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
" 8.28% | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.style.format({'Percent Surviving': \"{:.2%}\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**N.B.** a peculiarity of the Jupyter notebook is that if you make a mistake e.g. misspelling a column name, when you run the formatting function a nwe column will be created. to remive this colummn you must retart the kernel. \n",
"\n",
"*Toolbar - Kernel - Restart & Clear Output*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choosing Sets of Columns ###\n",
"To select particular columns we can use `df.['col1', 'col2']` which creates a new table that contains only the specified columns. When selecting a single column we can use one set of square brackets, when selecting multiple columns two sets of swuare brackets are required."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude\n",
"0 32.0 54.8\n",
"1 33.2 54.9\n",
"2 34.4 55.5\n",
"3 37.6 55.8\n",
"4 34.3 55.2\n",
"5 32.0 54.6\n",
"6 30.4 54.4\n",
"7 26.8 54.3"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard[['Longitude', 'Latitude']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same selection can be made using column indices instead of labels.\n",
"\n",
"**N.B.** the column range selected is 0:2 with the range being *bottom heavy*. Though the range bottom limit is 0 and the top limit is 2 instead of processing elements 0, 1 and 2 only elements 0 and 1 will be processed i.e. *bottom heavy* or *top light*"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude\n",
"0 32.0 54.8\n",
"1 33.2 54.9\n",
"2 34.4 55.5\n",
"3 37.6 55.8\n",
"4 34.3 55.2\n",
"5 32.0 54.6\n",
"6 30.4 54.4\n",
"7 26.8 54.3"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.iloc[:, 0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result of using `df.[' ']` is a new DataFrame, even when you select just one column."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 145000\n",
"1 140000\n",
"2 127100\n",
"3 100000\n",
"4 55000\n",
"5 24000\n",
"6 20000\n",
"7 12000\n",
"Name: Survivors, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard['Survivors']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to create a new table consisting of a set of columns is to `drop` the columns you don't want."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" City Name | \n",
" Survivors | \n",
" Percent Surviving | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Smolensk | \n",
" 145000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" Dorogobouge | \n",
" 140000 | \n",
" 0.965517 | \n",
"
\n",
" \n",
" 2 | \n",
" Chjat | \n",
" 127100 | \n",
" 0.876552 | \n",
"
\n",
" \n",
" 3 | \n",
" Moscou | \n",
" 100000 | \n",
" 0.689655 | \n",
"
\n",
" \n",
" 4 | \n",
" Wixma | \n",
" 55000 | \n",
" 0.379310 | \n",
"
\n",
" \n",
" 5 | \n",
" Smolensk | \n",
" 24000 | \n",
" 0.165517 | \n",
"
\n",
" \n",
" 6 | \n",
" Orscha | \n",
" 20000 | \n",
" 0.137931 | \n",
"
\n",
" \n",
" 7 | \n",
" Moiodexno | \n",
" 12000 | \n",
" 0.082759 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" City Name Survivors Percent Surviving\n",
"0 Smolensk 145000 1.000000\n",
"1 Dorogobouge 140000 0.965517\n",
"2 Chjat 127100 0.876552\n",
"3 Moscou 100000 0.689655\n",
"4 Wixma 55000 0.379310\n",
"5 Smolensk 24000 0.165517\n",
"6 Orscha 20000 0.137931\n",
"7 Moiodexno 12000 0.082759"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard.drop(columns=['Longitude', 'Latitude', 'Direction'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neither `df.[' ']` nor `drop` change the original DataFrame. Instead, they create new smaller DataFrames that share the same data. The fact that the original DataFrame is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Longitude | \n",
" Latitude | \n",
" City Name | \n",
" Direction | \n",
" Survivors | \n",
" Percent Surviving | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32.0 | \n",
" 54.8 | \n",
" Smolensk | \n",
" Advance | \n",
" 145000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 33.2 | \n",
" 54.9 | \n",
" Dorogobouge | \n",
" Advance | \n",
" 140000 | \n",
" 0.965517 | \n",
"
\n",
" \n",
" 2 | \n",
" 34.4 | \n",
" 55.5 | \n",
" Chjat | \n",
" Advance | \n",
" 127100 | \n",
" 0.876552 | \n",
"
\n",
" \n",
" 3 | \n",
" 37.6 | \n",
" 55.8 | \n",
" Moscou | \n",
" Advance | \n",
" 100000 | \n",
" 0.689655 | \n",
"
\n",
" \n",
" 4 | \n",
" 34.3 | \n",
" 55.2 | \n",
" Wixma | \n",
" Retreat | \n",
" 55000 | \n",
" 0.379310 | \n",
"
\n",
" \n",
" 5 | \n",
" 32.0 | \n",
" 54.6 | \n",
" Smolensk | \n",
" Retreat | \n",
" 24000 | \n",
" 0.165517 | \n",
"
\n",
" \n",
" 6 | \n",
" 30.4 | \n",
" 54.4 | \n",
" Orscha | \n",
" Retreat | \n",
" 20000 | \n",
" 0.137931 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.8 | \n",
" 54.3 | \n",
" Moiodexno | \n",
" Retreat | \n",
" 12000 | \n",
" 0.082759 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Longitude Latitude City Name Direction Survivors Percent Surviving\n",
"0 32.0 54.8 Smolensk Advance 145000 1.000000\n",
"1 33.2 54.9 Dorogobouge Advance 140000 0.965517\n",
"2 34.4 55.5 Chjat Advance 127100 0.876552\n",
"3 37.6 55.8 Moscou Advance 100000 0.689655\n",
"4 34.3 55.2 Wixma Retreat 55000 0.379310\n",
"5 32.0 54.6 Smolensk Retreat 24000 0.165517\n",
"6 30.4 54.4 Orscha Retreat 20000 0.137931\n",
"7 26.8 54.3 Moiodexno Retreat 12000 0.082759"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the methods that we have used above can be applied to any DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}