{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "path_data = '../../data/'\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('fivethirtyeight')\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [],
   "source": [
    "def population(prior_prob_disease):\n",
    "    n_d = int(prior_prob_disease*100000)\n",
    "    n_nd = 100000 - n_d\n",
    "    n_pos_d = int(0.99*n_d)\n",
    "    n_neg_d = n_d - n_pos_d\n",
    "    n_pos_nd = int(0.005*n_nd)\n",
    "    n_neg_nd = n_nd - n_pos_nd\n",
    "    condition = np.array(['Disease']*n_d + ['No Disease']*n_nd)\n",
    "    d_test = np.array(['Positive']*n_pos_d + ['Negative']*n_neg_d)\n",
    "    nd_test = np.array(['Positive']*n_pos_nd + ['Negative']*n_neg_nd)\n",
    "    test = np.append(d_test, nd_test)\n",
    "    t = pd.DataFrame(\n",
    "        {'True Condition':condition,\n",
    "        'Test Result':test}\n",
    "    )\n",
    "    return t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Making Decisions\n",
    "A primary use of Bayes' Rule is to make decisions based on incomplete information, incorporating new information as it comes in. This section points out the importance of keeping your assumptions in mind as you make decisions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many medical tests for diseases return Positive or Negative results. A Positive result means that according to the test, the patient has the disease. A Negative result means the test concludes that the patient doesn't have the disease. \n",
    "\n",
    "Medical tests are carefully designed to be very accurate. But few tests are accurate 100% of the time. Almost all tests make errors of two kinds:\n",
    "\n",
    "- A **false positive** is an error in which the test concludes Positive but the patient doesn't have the disease.\n",
    "\n",
    "- A **false negative** is an error in which the test concludes Negative but the patient does have the disease.\n",
    "\n",
    "These errors can affect people's decisions. False positives can cause anxiety and unnecessary treatment (which in some cases is expensive or dangerous). False negatives can have even more serious consequences if the patient doesn't receive treatment because of their Negative test result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Test for a Rare Disease\n",
    "Suppose there is a large population and a disease that strikes a tiny proportion of the population. The tree diagram below summarizes information about such a disease and about a medical test for it.\n",
    "\n",
    "![Tree Rare Disease](tree_disease_rare.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall, only 4 in 1000 of the population has the disease. The test is quite accurate: it has a very small false positive rate of 5 in 1000, and a somewhat larger (though still small) false negative rate of 1 in 100.\n",
    "\n",
    "Individuals might or might not know whether they have the disease; typically, people get tested to find out whether they have it.\n",
    "\n",
    "So **suppose a person is picked at random from the population** and tested. If the test result is Positive, how would you classify them: Disease, or No disease?\n",
    "\n",
    "We can answer this by applying Bayes' Rule and using our \"more likely than not\" classifier. Given that the person has tested Positive, the chance that he or she has the disease is the proportion in the top branch, relative to the total proportion in the Test Positive branches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.44295302013422816"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(0.004 * 0.99)/(0.004 * 0.99  +  0.996*0.005 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that the person has tested Positive, the chance that he or she has the disease is about 44%. So we will classify them as: No disease.\n",
    "\n",
    "This is a strange conclusion. We have a pretty accurate test, and a person who has tested Positive, and our classification is ... that they **don't** have the disease? That doesn't seem to make any sense.\n",
    "\n",
    "When faced with a disturbing answer, the first thing to do is to check the calculations. The arithmetic above is correct. Let's see if we can get the same answer in a different way.\n",
    "\n",
    "The function `population` returns a table of outcomes for 100,000 patients, with columns that show the `True Condition` and `Test Result`. The test is the same as the one described in the tree. But the proportion who have the disease is an argument to the function.\n",
    "\n",
    "We will call `population` with 0.004 as the argument, and then pivot to cross-classify each of the 100,000 people."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Test_Result</th>\n",
       "      <th>Negative</th>\n",
       "      <th>Positive</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True Condition</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Disease</th>\n",
       "      <td>4</td>\n",
       "      <td>396</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No Disease</th>\n",
       "      <td>99102</td>\n",
       "      <td>498</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Test_Result     Negative  Positive\n",
       "True Condition                    \n",
       "Disease                4       396\n",
       "No Disease         99102       498"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pop_004 = population(0.004)\n",
    "\n",
    "pop_004_rename = pop_004.rename(columns={'Test Result':'Test_Result'})\n",
    "\n",
    "pop_004_pivot = pop_004_rename.groupby(['True Condition', 'Test_Result']).Test_Result.agg('count').to_frame('count').reset_index()\n",
    "\n",
    "pop_004_pivot\n",
    "\n",
    "pd.pivot_table(pop_004_pivot, values='count', index=['True Condition'], columns=['Test_Result'], aggfunc=np.sum).fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cells of the table have the right counts. For example, according to the description of the population, 4 in 1000 people have the disease. There are 100,000 people in the table, so 400 should have the disease. That's what the table shows: 4 + 396 = 400. Of these 400, 99% get a Positive test result: 0.99 x 400 = 396.\n",
    "\n",
    "Among the Positives, the proportion that have the disease is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4429530201342282"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "396/(396 + 498)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's the answer we got by using Bayes' Rule. The counts in the Positives column show why it is less than 1/2. Among the Positives, more people **don't** have the disease than do have the disease. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reason is that a huge fraction of the population doesn't have the disease in the first place. The tiny fraction of those that falsely test Positive are still greater in number than the people who correctly test Positive. This is easier to visualize in the tree diagram:\n",
    "\n",
    "![Tree Rare Disease](tree_disease_rare.png)\n",
    "\n",
    "- The proportion of true Positives is a large fraction (0.99) of a tiny fraction (0.004) of the population.\n",
    "- The proportion of false Positives is a tiny fraction (0.005) of a large fraction (0.996) of the population.\n",
    "\n",
    "These two proportions are comparable; the second is a little larger.\n",
    "\n",
    "So, given that the randomly chosen person tested positive, we were right to classify them as more likely than not to **not** have the disease."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Subjective Prior\n",
    "Being right isn't always satisfying. Classifying a Positive patient as not having the disease still seems somehow wrong, for such an accurate test. Since the calculations are right, let's take a look at the basis of our probability calculation: the assumption of randomness.\n",
    "\n",
    "Our assumption was that a randomly chosen person was tested and got a Positive result. But this doesn't happen in reality. People go in to get tested because they think they might have the disease, or because their doctor thinks they might have the disease. **People getting tested are not randomly chosen members of the population.**\n",
    "\n",
    "That is why our intuition about people getting tested was not fitting well with the answer that we got. We were imagining a realistic situation of a patient going in to get tested because there was some reason for them to do so, whereas the calculation was based on a randomly chosen person being tested.\n",
    "\n",
    "So let's redo our calculation under the more realistic assumption that the patient is getting tested because the doctor thinks there's a chance the patient has the disease.\n",
    "\n",
    "Here it's important to note that \"the doctor thinks there's a chance\" means that the chance is the doctor's opinion, not the proportion in the population. It is called a *subjective probability*. In our context of whether or not the patient has the disease, it is also a *subective prior* probability.\n",
    "\n",
    "Some researchers insist that all probabilities must be relative frequencies, but subjective probabilities abound. The chance that a candidate wins the next election, the chance that a big earthquake will hit the Bay Area in the next decade, the chance that a particular country wins the next soccer World Cup: none of these are based on relative frequencies or long run frequencies. Each one contains a subjective element. All calculations involving them thus have a subjective element too."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose the doctor's subjective opinion is that there is a 5% chance that the patient has the disease. Then just the prior probabilities in the tree diagram will change:\n",
    "\n",
    "![Tree: Subjective Prior](tree_disease_subj.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that the patient tests Positive, the chance that he or she has the disease is given by Bayes' Rule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9124423963133641"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(0.05 * 0.99)/(0.05 * 0.99  +  0.95 * 0.005)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The effect of changing the prior is stunning. Even though the doctor has a pretty low prior probability (5%) that the patient has the disease, once the patient tests Positive the posterior probability of having the disease shoots up to more than 91%. \n",
    "\n",
    "If the patient tests Positive, it would be reasonable for the doctor to proceed as though the patient has the disease."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confirming the Answer\n",
    "Though the doctor's opinion is subjective, we can generate an artificial population in which 5% of the people have the disease and are tested using the same test. Then we can count people in different categories to see if the counts are consistent with the answer we got by using Bayes' Rule.\n",
    "\n",
    "We can use `population(0.05)` and `pivot` to construct the corresponding population and look at the counts in the four cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Test_Result</th>\n",
       "      <th>Negative</th>\n",
       "      <th>Positive</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True Condition</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Disease</th>\n",
       "      <td>50</td>\n",
       "      <td>4950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No Disease</th>\n",
       "      <td>94525</td>\n",
       "      <td>475</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Test_Result     Negative  Positive\n",
       "True Condition                    \n",
       "Disease               50      4950\n",
       "No Disease         94525       475"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pop_004 = population(0.05)\n",
    "\n",
    "pop_004_rename = pop_004.rename(columns={'Test Result':'Test_Result'})\n",
    "\n",
    "pop_004_pivot = pop_004_rename.groupby(['True Condition', 'Test_Result']).Test_Result.agg('count').to_frame('count').reset_index()\n",
    "\n",
    "pop_004_pivot\n",
    "\n",
    "pd.pivot_table(pop_004_pivot, values='count', index=['True Condition'], columns=['Test_Result'], aggfunc=np.sum).fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this artificially created population of 100,000 people, 5000 people (5%) have the disease, and 99% of them test Positive, leading to 4950 true Positives. Compare this with 475 false Positives: among the Positives, the proportion that have the disease is the same as what we got by Bayes' Rule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9124423963133641"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "4950/(4950 + 475)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we can generate a population that has the right proportions, we can also use simulation to confirm that our answer is reasonable. The table `pop_05` contains a population of 100,000 people generated with the doctor's prior disease probability of 5% and the error rates of the test. We take a simple random sample of size 10,000 from the population, and extract the table `positive` consisting only of those in the sample that had Positive test results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "pop_05 = population(0.05)\n",
    "\n",
    "sample = pop_05.sample(10000, replace=False)\n",
    "\n",
    "positive = sample[sample['Test Result'] == 'Positive']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Among these Positive results, what proportion were true Positives? That's the proportion of Positives that had the disease:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9171075837742504"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(sample[sample['True Condition'] == 'Disease'])/len(positive)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the two cells a few times and you will see that the proportion of true Positives among the Positives hovers around the value of 0.912 that we calculated by Bayes' Rule.\n",
    "\n",
    "You can also use the `population` function with a different argument to change the prior disease probability and see how the posterior probabilities are affected."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}