{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Percentiles\n", "Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a *rank order*. A percentile is the value at a particular rank.\n", "\n", "For example, if your score on a test is on the 95th percentile, a common interpretation is that only 5% of the scores were higher than yours. The median is the 50th percentile; it is commonly assumed that 50% the values in a data set are above the median.\n", "\n", "But some care is required in giving percentiles a precise definition that works for all ranks and all lists. To see why, consider an extreme example where all the students in a class score 75 on a test. Then 75 is a natural candidate for the median, but it's not true that 50% of the scores are above 75. Also, 75 is an equally natural candidate for the 95th percentile or the 25th or any other percentile. Ties – that is, equal data values – have to be taken into account when defining percentiles.\n", "\n", "You also have to be careful about exactly how far up the list to go when the relevant index isn't clear. For example, what should be the 87th percentile of a collection of 10 values? The 8th value of the sorted collection, or the 9th, or somewhere in between?\n", "\n", "In this section, we will give a definition that works consistently for all ranks and all lists." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Numerical Example\n", "Before giving a general definition of all percentiles, we will define the 80th percentile of a collection of values to be the smallest value in the collection that is at least as large as 80% of all of the values.\n", "\n", "For example, let's consider the sizes of the five largest continents – Africa, Antarctica, Asia, North America, and South America – rounded to the nearest million square miles." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sizes = np.array([12, 17, 6, 9, 7])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 80th percentile is the smallest value that is at least as large as 80% of the elements of `sizes`, that is, four-fifths of the five elements. That's 12:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 6, 7, 9, 12, 17])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sort(sizes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 80th percentile is a value on the list, namely 12. You can see that 80% of the values are less than or equal to it, and that it is the smallest value on the list for which this is true.\n", "\n", "Analogously, the 70th percentile is the smallest value in the collection that is at least as large as 70% of the elements of `sizes`. Now 70% of 5 elements is \"3.5 elements\", so the 70th percentile is the 4th element on the list. That's 12, the same as the 80th percentile for these data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `percentile` function\n", "The numpy `percentile` function takes two arguments: a array as a aource and a rank between 0 and 100. It returns the corresponding percentile of the array. One of the Nuy percentile function parameters is 'interpolation' the options for which are - *interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}*\n", "\n", "[Numpy percentile](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile(sizes, 70, interpolation='nearest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The General Definition\n", "Let $p$ be a number between 0 and 100. The $p$th percentile of a collection is the smallest value in the collection that is at least as large as p% of all the values.\n", "\n", "By this definition, any percentile between 0 and 100 can be computed for any collection of values, and it is always an element of the collection. \n", "\n", "In practical terms, suppose there are $n$ elements in the collection. To find the $p$th percentile:\n", "- Sort the collection in increasing order.\n", "- Find p% of n: $(p/100) \\times n$. Call that $k$.\n", "- If $k$ is an integer, take the $k$th element of the sorted collection.\n", "- If $k$ is not an integer, round it up to the next integer, and take that element of the sorted collection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example\n", "The table `scores_and_sections` contains one row for each student in a class of 359 students. The columns are the student's discussion section and midterm score. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Section | \n", "Midterm | \n", "
---|---|---|
0 | \n", "1 | \n", "22 | \n", "
1 | \n", "2 | \n", "12 | \n", "
2 | \n", "2 | \n", "23 | \n", "
3 | \n", "2 | \n", "14 | \n", "
4 | \n", "1 | \n", "20 | \n", "
5 | \n", "3 | \n", "25 | \n", "
6 | \n", "4 | \n", "19 | \n", "
7 | \n", "1 | \n", "24 | \n", "
8 | \n", "5 | \n", "8 | \n", "
9 | \n", "6 | \n", "14 | \n", "