Percentiles.py 5.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
  1. #!/usr/bin/env python
  2. # coding: utf-8
  3. # In[1]:
  4. from datascience import *
  5. get_ipython().run_line_magic('matplotlib', 'inline')
  6. path_data = '../../../../data/'
  7. import matplotlib.pyplot as plt
  8. plt.style.use('fivethirtyeight')
  9. import numpy as np
  10. # ### Percentiles ###
  11. # Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a *rank order*. A percentile is the value at a particular rank.
  12. #
  13. # For example, if your score on a test is on the 95th percentile, a common interpretation is that only 5% of the scores were higher than yours. The median is the 50th percentile; it is commonly assumed that 50% the values in a data set are above the median.
  14. #
  15. # But some care is required in giving percentiles a precise definition that works for all ranks and all lists. To see why, consider an extreme example where all the students in a class score 75 on a test. Then 75 is a natural candidate for the median, but it's not true that 50% of the scores are above 75. Also, 75 is an equally natural candidate for the 95th percentile or the 25th or any other percentile. Ties – that is, equal data values – have to be taken into account when defining percentiles.
  16. #
  17. # You also have to be careful about exactly how far up the list to go when the relevant index isn't clear. For example, what should be the 87th percentile of a collection of 10 values? The 8th value of the sorted collection, or the 9th, or somewhere in between?
  18. #
  19. # In this section, we will give a definition that works consistently for all ranks and all lists.
  20. # ### A Numerical Example ###
  21. # Before giving a general definition of all percentiles, we will define the 80th percentile of a collection of values to be the smallest value in the collection that is at least as large as 80% of all of the values.
  22. #
  23. # For example, let's consider the sizes of the five largest continents – Africa, Antarctica, Asia, North America, and South America – rounded to the nearest million square miles.
  24. # In[2]:
  25. sizes = make_array(12, 17, 6, 9, 7)
  26. # The 80th percentile is the smallest value that is at least as large as 80% of the elements of `sizes`, that is, four-fifths of the five elements. That's 12:
  27. # In[3]:
  28. np.sort(sizes)
  29. # The 80th percentile is a value on the list, namely 12. You can see that 80% of the values are less than or equal to it, and that it is the smallest value on the list for which this is true.
  30. #
  31. # Analogously, the 70th percentile is the smallest value in the collection that is at least as large as 70% of the elements of `sizes`. Now 70% of 5 elements is "3.5 elements", so the 70th percentile is the 4th element on the list. That's 12, the same as the 80th percentile for these data.
  32. # ### The `percentile` function ###
  33. # The `percentile` function takes two arguments: a rank between 0 and 100, and a array. It returns the corresponding percentile of the array.
  34. # In[4]:
  35. percentile(70, sizes)
  36. # ### The General Definition ###
  37. #
  38. # Let $p$ be a number between 0 and 100. The $p$th percentile of a collection is the smallest value in the collection that is at least as large as p% of all the values.
  39. #
  40. # By this definition, any percentile between 0 and 100 can be computed for any collection of values, and it is always an element of the collection.
  41. #
  42. # In practical terms, suppose there are $n$ elements in the collection. To find the $p$th percentile:
  43. # - Sort the collection in increasing order.
  44. # - Find p% of n: $(p/100) \times n$. Call that $k$.
  45. # - If $k$ is an integer, take the $k$th element of the sorted collection.
  46. # - If $k$ is not an integer, round it up to the next integer, and take that element of the sorted collection.
  47. # ### Example ###
  48. # The table `scores_and_sections` contains one row for each student in a class of 359 students. The columns are the student's discussion section and midterm score.
  49. # In[5]:
  50. scores_and_sections = Table.read_table(path_data + 'scores_by_section.csv')
  51. scores_and_sections
  52. # In[6]:
  53. scores_and_sections.select('Midterm').hist(bins=np.arange(-0.5, 25.6, 1))
  54. # What was the 85th percentile of the scores? To use the `percentile` function, create an array `scores` containing the midterm scores, and find the 85th percentile:
  55. # In[7]:
  56. scores = scores_and_sections.column(1)
  57. # In[8]:
  58. percentile(85, scores)
  59. # According to the percentile function, the 85th percentile was 22. To check that this is consistent with our new definition, let's apply the definition directly.
  60. #
  61. # First, put the scores in increasing order:
  62. # In[9]:
  63. sorted_scores = np.sort(scores_and_sections.column(1))
  64. # There are 359 scores in the array. So next, find 85% of 359, which is 305.15.
  65. # In[10]:
  66. 0.85 * 359
  67. # That's not an integer. By our definition, the 85th percentile is the 306th element of `sorted_scores`, which, by Python's indexing convention, is item 305 of the array.
  68. # In[11]:
  69. # The 306th element of the sorted array
  70. sorted_scores.item(305)
  71. # That's the same as the answer we got by using `percentile`. In future, we will just use `percentile`.
  72. # ### Quartiles ###
  73. # The *first quartile* of a numercial collection is the 25th percentile. The terminology arises from *the first quarter*. The second quartile is the median, and the third quartile is the 75th percentile.
  74. #
  75. # For our `scores` data, those values are:
  76. # In[12]:
  77. percentile(25, scores)
  78. # In[13]:
  79. percentile(50, scores)
  80. # In[14]:
  81. percentile(75, scores)
  82. # Distributions of scores are sometimes summarized by the "middle 50%" interval, between the first and third quartiles.