{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "path_data = '../../data/'\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('fivethirtyeight')\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifying by One Variable\n",
    "\n",
    "Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton's data on heights, we saw that it was useful to classify families according to the parents' midparent heights, and then find the average height of the children in each group.\n",
    "\n",
    "This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of `group`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Counting the Number in Each Category\n",
    "The `group` method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.\n",
    "\n",
    "Here is a small table of data on ice cream cones. The `group` method can be used to list the distinct flavors and provide the counts of each flavor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flavor</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>strawberry</td>\n",
       "      <td>3.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>4.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>6.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>strawberry</td>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Flavor  Price\n",
       "0  strawberry   3.55\n",
       "1   chocolate   4.75\n",
       "2   chocolate   6.55\n",
       "3  strawberry   5.25\n",
       "4   chocolate   5.25"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cones = pd.DataFrame({\n",
    "    'Flavor':np.array(['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate']),\n",
    "    'Price':np.array([3.55, 4.75, 6.55, 5.25, 5.25])}\n",
    ")\n",
    "cones"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Flavor</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chocolate</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strawberry</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            count\n",
       "Flavor           \n",
       "chocolate       3\n",
       "strawberry      2"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_grouped = cones.groupby([\"Flavor\"]).agg(\n",
    "    count=pd.NamedAgg(column=\"Flavor\", aggfunc=\"count\")\n",
    ")\n",
    "\n",
    "df_grouped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two distinct categories, chocolate and strawberry. When we call `groupby` we must state what we want to do with the group data e.g. `count()`. Applying the `count()` method will create a column of counts whcih takes the names of the first column in the df by default, and contains the number of rows in each category. To make this easier to read we could change the count column to 'count'.\n",
    "\n",
    "Notice that this can all be worked out from just the `Flavor` column. Only the `Price` column name has been used, the data has not been used.\n",
    "\n",
    "But what if we wanted the total price of the cones of each different flavor? In this case we can apply a different method e.g. `sum()`, to `groupby`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding a Characteristic of Each Category\n",
    "The optional second argument of `group` names the function that will be used to aggregate values in other columns for all of those rows. For instance, `sum` will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.\n",
    "\n",
    "To find the total price of each flavor, we call `group` again, with `Flavor` as its first argument as before. But this time there is a second argument: the function name `sum`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price_sum</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Flavor</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chocolate</th>\n",
       "      <td>16.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strawberry</th>\n",
       "      <td>8.80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Price_sum\n",
       "Flavor               \n",
       "chocolate       16.55\n",
       "strawberry       8.80"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_grouped_by_price = cones.groupby([\"Flavor\"]).agg(\n",
    "    Price_sum=pd.NamedAgg(column=\"Price\", aggfunc=\"sum\")\n",
    ")\n",
    "\n",
    "df_grouped_by_price"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create this new table, `groupby` has calculated the **sum** of the `Price` entries in all the rows corresponding to each distinct flavor. The prices in the three `chocolate` rows add up to 16.55 (in whatever currency). The prices in the two `strawberry` rows have a total of 8.80.\n",
    "\n",
    "Using pandas `groupby aggregation` we can compute a summary statistic (or statistics). The label of the newly created column is `Price sum`, which is created by taking the label of the column being summed, and appending the word `sum` in the *aggregation pipeline*. \n",
    "\n",
    "In this insatnce there are only two columns so when `group` finds the `sum` of all columns other than the one with the categories, there is no need to specify that it has to `sum` the prices. Using the [`Pandas NamedAgg`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation) function we can name the column contining the results of the aggregation.\n",
    "\n",
    "To see in more detail what `group` is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    4.75\n",
       "2    6.55\n",
       "4    5.25\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cones[cones['Flavor'] == 'chocolate']['Price']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Price\n",
       "1   4.75\n",
       "2   6.55\n",
       "4   5.25"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cones[cones['Flavor'] == 'chocolate'][['Price']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Price    16.55\n",
       "dtype: float64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(cones[cones['Flavor'] == 'chocolate'][['Price']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16.55"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum([cones[cones['Flavor'] == 'chocolate'][['Price']]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what `groupby` is doing for each distinct value in `Flavor`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flavor</th>\n",
       "      <th>Array of All the Prices</th>\n",
       "      <th>Sum of the Array</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>[4.75, 6.55, 5.25]</td>\n",
       "      <td>16.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>strawberry</td>\n",
       "      <td>[3.55, 5.25]</td>\n",
       "      <td>8.80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Flavor Array of All the Prices  Sum of the Array\n",
       "0   chocolate      [4.75, 6.55, 5.25]             16.55\n",
       "1  strawberry            [3.55, 5.25]              8.80"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# For each distinct value in `Flavor, access all the rows\n",
    "# and create an array of `Price`\n",
    "\n",
    "cones_choc = cones[cones['Flavor'] == 'chocolate']['Price']\n",
    "\n",
    "cones_strawb = cones[cones['Flavor'] =='strawberry']['Price']\n",
    "\n",
    "# Display the arrays in a table\n",
    "\n",
    "cones_choc = np.array(cones_choc)\n",
    "cones_strawb = np.array(cones_strawb)\n",
    "\n",
    "grouped_cones = pd.DataFrame({\n",
    "    'Flavor':np.array(['chocolate', 'strawberry']),\n",
    "    'Array of All the Prices':[cones_choc, cones_strawb]}\n",
    ")\n",
    "\n",
    "#priceTotals\n",
    "\n",
    "# Append a column with the sum of the `Price` values in each array\n",
    "\n",
    "price_totals = grouped_cones\n",
    "\n",
    "price_totals['Sum of the Array'] = np.array([sum(cones_choc), sum(cones_strawb)])\n",
    "\n",
    "price_totals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can replace `sum` by any other functions that work on arrays. For example, you could use `max` to find the largest price in each category:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Flavor</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chocolate</th>\n",
       "      <td>6.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strawberry</th>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Price\n",
       "Flavor           \n",
       "chocolate    6.55\n",
       "strawberry   5.25"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cones.groupby('Flavor').max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### *Or*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price_Max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Flavor</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chocolate</th>\n",
       "      <td>6.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strawberry</th>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Price_Max\n",
       "Flavor               \n",
       "chocolate        6.55\n",
       "strawberry       5.25"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "price_max = cones.groupby([\"Flavor\"]).agg(\n",
    "    Price_Max=pd.NamedAgg(column=\"Price\", aggfunc=\"max\")\n",
    ")\n",
    "\n",
    "price_max"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, `groupby` creates arrays of the prices in each `Flavor` category. But now it finds the `max` of each array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flavor</th>\n",
       "      <th>Array of All the Prices</th>\n",
       "      <th>Sum of the Array</th>\n",
       "      <th>Max of the Array</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>[4.75, 6.55, 5.25]</td>\n",
       "      <td>16.55</td>\n",
       "      <td>6.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>strawberry</td>\n",
       "      <td>[3.55, 5.25]</td>\n",
       "      <td>8.80</td>\n",
       "      <td>5.25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Flavor Array of All the Prices  Sum of the Array  Max of the Array\n",
       "0   chocolate      [4.75, 6.55, 5.25]             16.55              6.55\n",
       "1  strawberry            [3.55, 5.25]              8.80              5.25"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "price_max = grouped_cones.copy()\n",
    "\n",
    "price_max['Max of the Array'] = np.array([max(cones_choc), max(cones_strawb)])\n",
    "\n",
    "price_max"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, the original call to `group` with just one argument has the same effect as using `len` as the function and then cleaning up the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Flavor</th>\n",
       "      <th>Array of All the Prices</th>\n",
       "      <th>Sum of the Array</th>\n",
       "      <th>Length of the Array</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chocolate</td>\n",
       "      <td>[4.75, 6.55, 5.25]</td>\n",
       "      <td>16.55</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>strawberry</td>\n",
       "      <td>[3.55, 5.25]</td>\n",
       "      <td>8.80</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Flavor Array of All the Prices  Sum of the Array  Length of the Array\n",
       "0   chocolate      [4.75, 6.55, 5.25]             16.55                    3\n",
       "1  strawberry            [3.55, 5.25]              8.80                    2"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_length = grouped_cones.copy()\n",
    "\n",
    "array_length['Length of the Array'] = np.array([len(cones_choc), len(cones_strawb)])\n",
    "\n",
    "array_length"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: NBA Salaries\n",
    "The table `nba` contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PLAYER</th>\n",
       "      <th>POSITION</th>\n",
       "      <th>TEAM</th>\n",
       "      <th>SALARY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Paul Millsap</td>\n",
       "      <td>PF</td>\n",
       "      <td>Atlanta Hawks</td>\n",
       "      <td>18.671659</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Al Horford</td>\n",
       "      <td>C</td>\n",
       "      <td>Atlanta Hawks</td>\n",
       "      <td>12.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Tiago Splitter</td>\n",
       "      <td>C</td>\n",
       "      <td>Atlanta Hawks</td>\n",
       "      <td>9.756250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Jeff Teague</td>\n",
       "      <td>PG</td>\n",
       "      <td>Atlanta Hawks</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Kyle Korver</td>\n",
       "      <td>SG</td>\n",
       "      <td>Atlanta Hawks</td>\n",
       "      <td>5.746479</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>412</th>\n",
       "      <td>Gary Neal</td>\n",
       "      <td>PG</td>\n",
       "      <td>Washington Wizards</td>\n",
       "      <td>2.139000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>DeJuan Blair</td>\n",
       "      <td>C</td>\n",
       "      <td>Washington Wizards</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>414</th>\n",
       "      <td>Kelly Oubre Jr.</td>\n",
       "      <td>SF</td>\n",
       "      <td>Washington Wizards</td>\n",
       "      <td>1.920240</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>415</th>\n",
       "      <td>Garrett Temple</td>\n",
       "      <td>SG</td>\n",
       "      <td>Washington Wizards</td>\n",
       "      <td>1.100602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416</th>\n",
       "      <td>Jarell Eddie</td>\n",
       "      <td>SG</td>\n",
       "      <td>Washington Wizards</td>\n",
       "      <td>0.561716</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>417 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              PLAYER POSITION                TEAM     SALARY\n",
       "0       Paul Millsap       PF       Atlanta Hawks  18.671659\n",
       "1         Al Horford        C       Atlanta Hawks  12.000000\n",
       "2     Tiago Splitter        C       Atlanta Hawks   9.756250\n",
       "3        Jeff Teague       PG       Atlanta Hawks   8.000000\n",
       "4        Kyle Korver       SG       Atlanta Hawks   5.746479\n",
       "..               ...      ...                 ...        ...\n",
       "412        Gary Neal       PG  Washington Wizards   2.139000\n",
       "413     DeJuan Blair        C  Washington Wizards   2.000000\n",
       "414  Kelly Oubre Jr.       SF  Washington Wizards   1.920240\n",
       "415   Garrett Temple       SG  Washington Wizards   1.100602\n",
       "416     Jarell Eddie       SG  Washington Wizards   0.561716\n",
       "\n",
       "[417 rows x 4 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nba1 = pd.read_csv(path_data + 'nba_salaries.csv')\n",
    "\n",
    "nba = nba1.rename(columns={\"'15-'16 SALARY\": 'SALARY'})\n",
    "\n",
    "nba"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.** How much money did each team pay for its players' salaries?\n",
    "\n",
    "The only columns involved are `TEAM` and `SALARY`. We have to `group` the rows by `TEAM` and then `sum` the salaries of the groups. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SALARY</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TEAM</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Atlanta Hawks</th>\n",
       "      <td>69.573103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Boston Celtics</th>\n",
       "      <td>50.285499</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brooklyn Nets</th>\n",
       "      <td>57.306976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Charlotte Hornets</th>\n",
       "      <td>84.102397</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chicago Bulls</th>\n",
       "      <td>78.820890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cleveland Cavaliers</th>\n",
       "      <td>102.312412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dallas Mavericks</th>\n",
       "      <td>65.762559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Denver Nuggets</th>\n",
       "      <td>62.429404</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Detroit Pistons</th>\n",
       "      <td>42.211760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Golden State Warriors</th>\n",
       "      <td>94.085137</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Houston Rockets</th>\n",
       "      <td>85.285837</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Indiana Pacers</th>\n",
       "      <td>62.695023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Los Angeles Clippers</th>\n",
       "      <td>66.074113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Los Angeles Lakers</th>\n",
       "      <td>68.607944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Memphis Grizzlies</th>\n",
       "      <td>93.796439</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Miami Heat</th>\n",
       "      <td>81.528667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Milwaukee Bucks</th>\n",
       "      <td>52.258355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Minnesota Timberwolves</th>\n",
       "      <td>65.847421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New Orleans Pelicans</th>\n",
       "      <td>80.514606</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York Knicks</th>\n",
       "      <td>69.404994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Oklahoma City Thunder</th>\n",
       "      <td>96.832165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Orlando Magic</th>\n",
       "      <td>77.623940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Philadelphia 76ers</th>\n",
       "      <td>42.481345</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Phoenix Suns</th>\n",
       "      <td>50.520815</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Portland Trail Blazers</th>\n",
       "      <td>45.446878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sacramento Kings</th>\n",
       "      <td>68.384890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>San Antonio Spurs</th>\n",
       "      <td>84.652074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Toronto Raptors</th>\n",
       "      <td>74.672620</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Utah Jazz</th>\n",
       "      <td>52.631878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Washington Wizards</th>\n",
       "      <td>90.047498</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            SALARY\n",
       "TEAM                              \n",
       "Atlanta Hawks            69.573103\n",
       "Boston Celtics           50.285499\n",
       "Brooklyn Nets            57.306976\n",
       "Charlotte Hornets        84.102397\n",
       "Chicago Bulls            78.820890\n",
       "Cleveland Cavaliers     102.312412\n",
       "Dallas Mavericks         65.762559\n",
       "Denver Nuggets           62.429404\n",
       "Detroit Pistons          42.211760\n",
       "Golden State Warriors    94.085137\n",
       "Houston Rockets          85.285837\n",
       "Indiana Pacers           62.695023\n",
       "Los Angeles Clippers     66.074113\n",
       "Los Angeles Lakers       68.607944\n",
       "Memphis Grizzlies        93.796439\n",
       "Miami Heat               81.528667\n",
       "Milwaukee Bucks          52.258355\n",
       "Minnesota Timberwolves   65.847421\n",
       "New Orleans Pelicans     80.514606\n",
       "New York Knicks          69.404994\n",
       "Oklahoma City Thunder    96.832165\n",
       "Orlando Magic            77.623940\n",
       "Philadelphia 76ers       42.481345\n",
       "Phoenix Suns             50.520815\n",
       "Portland Trail Blazers   45.446878\n",
       "Sacramento Kings         68.384890\n",
       "San Antonio Spurs        84.652074\n",
       "Toronto Raptors          74.672620\n",
       "Utah Jazz                52.631878\n",
       "Washington Wizards       90.047498"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "teams_and_money = nba[['TEAM', 'SALARY']]\n",
    "\n",
    "teams_and_money.groupby('TEAM').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.** How many NBA players were there in each of the five positions?\n",
    "\n",
    "We have to classify by `POSITION`, and count. This can be achieved by applying the `count()` method to a `groupby` or and the aggregation method with `aggfunc=\"count\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>POSITION</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PF</th>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PG</th>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SF</th>\n",
       "      <td>82</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SG</th>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          count\n",
       "POSITION       \n",
       "C            69\n",
       "PF           85\n",
       "PG           85\n",
       "SF           82\n",
       "SG           96"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#nba.groupby('POSITION').count()\n",
    "\n",
    "# -- or\n",
    "\n",
    "position_count = nba.groupby([\"POSITION\"]).agg(\n",
    "    count=pd.NamedAgg(column=\"PLAYER\", aggfunc=\"count\")\n",
    ")\n",
    "\n",
    "position_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3.** What was the average salary of the players at each of the five positions?\n",
    "\n",
    "This time, we have to group by `POSITION` and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SALARY</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>POSITION</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>6.082913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PF</th>\n",
       "      <td>4.951344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PG</th>\n",
       "      <td>5.165487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SF</th>\n",
       "      <td>5.532675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SG</th>\n",
       "      <td>3.988195</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            SALARY\n",
       "POSITION          \n",
       "C         6.082913\n",
       "PF        4.951344\n",
       "PG        5.165487\n",
       "SF        5.532675\n",
       "SG        3.988195"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "positions_and_money = nba[['POSITION', 'SALARY']]\n",
    "\n",
    "positions_and_money.groupby('POSITION').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Center was the most highly paid position, at an average of over 6 million dollars.\n",
    "\n",
    "If we had not selected the two columns as our first step, `group` would not attempt to \"average\" the categorical columns in `nba`. (It is impossible to average two strings like \"Atlanta Hawks\" and \"Boston Celtics\".) It performs arithmetic only on numerical columns and leaves the rest blank."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SALARY_mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>POSITION</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>6.082913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PF</th>\n",
       "      <td>4.951344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PG</th>\n",
       "      <td>5.165487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SF</th>\n",
       "      <td>5.532675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SG</th>\n",
       "      <td>3.988195</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          SALARY_mean\n",
       "POSITION             \n",
       "C            6.082913\n",
       "PF           4.951344\n",
       "PG           5.165487\n",
       "SF           5.532675\n",
       "SG           3.988195"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nba_mean = nba.groupby('POSITION').mean()\n",
    "\n",
    "nba_mean\n",
    "\n",
    "nba_mean = nba.groupby([\"POSITION\"]).agg(\n",
    "    SALARY_mean=pd.NamedAgg(column=\"SALARY\", aggfunc=\"mean\")\n",
    ")\n",
    "\n",
    "nba_mean"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}