Bayesian Math for Dummies

Steve Miller wrote an article a couple weeks ago on using Bayesian statistics for risk management. He describes his friend receiving a positive test on a serious medical condition and being worried. He then goes on to show why his friend needn’t be worried, because statistically there was a low probability of actual having the condition, even with the positive test.

Understanding risk is an interest of mine, and while I’ve read articles about Bayesian math in the past, the math is above my head. I never studied statistics, nor do I plan to. But I am interested in the concepts behind statistics, so I can understand probabilities better. And I can do basic math. Steve’s article was dense with math I didn’t quite get, but I was able to translate it into something I could understand.

So now, for statistically challenged individuals, I present my translation of Steve’s calculations, Bayesian math for dummies.

The Problem

Steve’s friend received a positive test for a disease. The disease occurs infrequently in the general population. The test accurately identifies people who have the disease, but gives false positives in 1 out of 20 tests, or 5% of the time. Should Steve’s friend be worried by his positive result?

In the example, we know four facts:

  • Overall Incidence Rate
    The disease occurs in 1 in 1,000 people, regardless of the test results.
  • True Positive Rate
    99% of people with the disease have a positive test.
  • False Positive Rate
    5% of people without the disease also have a positive test.
  • Present Condition
    Steve’s friend had a positive test.

The question is, given this information, what is the chance that Steve’s friend has the disease.

Before he had the test, we’d just use the overall incidence rate, since we have no other information. Thus, his chance would be 1 / 1000 = 0.1%. Given that he’s received a positive test result, the True Positive Rate of 99% looks scary and a 5% False Positive Rate sounds too small to matter. But what are his actual chances of having the disease?

The Long Way

Bayesian math presents an elegant way to calculate the chance Steve’s friend has the disease. Steve presents the math in his article. But let’s do it the long way, which is much easier for me to understand.

To gain an intuitive understanding of the problem, I translated from abstract probabilities to actual numbers of people. This allows us to normalize the percentage rates so we can compare them. Because while it sounds like we can compare the Overall Incidence Rate, True Positive Rate and False Positive Rate of 0.1%, 5% and 99%, each of these rates apply to different sized groups. And, as we’ll see, the size of the group it applies to makes all the difference.

For these calculations, we’re going to look at a population of 100,000 people, all of which we’re going to assume took the test. Out of those people, how many have the disease and how many don’t?

100,000 people total
100 have the disease (1 in 1,000 or 0.1%)
99,900 don’t have the disease

Okay, 100 people have the disease. How many of these people tested positive or negative? Remember that we know that 99% of the people who have the disease test positive.

100 have the disease
99 test positive (99%)
1 test negative (1%)

Out of the 99,900 people who who don’t have the disease, how many of those tested positive or negative? Remember that 5% of those who don’t have the disease test positive anyway.

99,900 don’t have the disease
4,995 test positive (5%)
94,905 test negative (95%)

Now is where it gets interesting. How many people tested positive versus negative in our entire group?

100,000 people total
5,094 test positive
94,906 test negative

So 5,094 people tested positive, but we know only 99 of those actually have the disease. The probability of actually having the disease if you test positive is then:

99 tested positive, and have the disease
5,094 tested positive in total
1.94% chance of having disease if you tested positive

Which is the same result Steve arrived at, though with the much quicker Bayesian math.

The Short Way

For those who want a shortcut to arriving at this conclusion, I’ve translated Steve’s equation below.

Incidence_Rate * True_Positive_Rate
( True_Positive_Rate * Incidence_Rate ) + ( False_Positive_Rate * ( 1 – Incidence_Rate ) )

Or, with the numbers from this example plugged in:

0.001 * 0.99
( 0.99 * 0.001 ) + ( 0.05 * ( 1 – 0.001 ) )

Which comes out to the same result: 1.94%

Business Applications

With this small understanding, what can you do with Bayesian math?

Let’s try a business example. Suppose you’ve been doing sales demos and you’re trying to determine how effective they are at closing business. Let’s say your close rate is 10%. You discover that 80% of buyers received a demo and only 20% of non-buyers received a demo. Clear and convincing evidence that demos work, right?

So what is the chance of someone buying if they see a demo? Let’s plug in the numbers:

0.10 * 0.80
( 0.80 * 0.10 ) + ( 0.20 * ( 1 – 0.10 ) )

The result: only a 30.8% chance, or slightly less than 1 in 3 people seeing the demo will buy.

That’s it for now. Tell me how you’re using Bayesian math in your business, or your ideas on how to apply this in the comments below.

20 comments

  1. Keith McCormick says:

    I thought this was extremely easy to read. I have been looking for ways to get a groups I train warmed up for understanding Bayesian Networks. I will certainly move on to the original article as well. Thanks!

  2. Scott Mutchler says:

    The 5% false positive rate is the key factor here. Steve probably assumed that only 5% of the positive results (like his) were incorrect not 5% of all the tests given.

  3. Tom says:

    This is not what the numbers tell us. Steve has a 1 in 20 chance or a 95% chance of having the disease. No doubt. He would not have been given the test unless someone already hypothesized that he had it or he would not have had it. 1 in 20 is the ratio of people that had the test that didn’t have it. You and I have a 1.94% chance of having the disease if we have yet to be tested.

    1. trevor says:

      Steve’s friend does not have a 95% chance of having the disease. This is a common mistake people make, which was the point of Steve’s article. A 5% false positive rate tells you nothing about your chances of having the disease if you have a positive test.

      The assumption provided in the math above is that all 100,000 people are tested. The 1.94% is the chance of actually having the disease if that test turns out positive, not of having the disease with not test. The chance of having the disease before you are tested is 0.1%, the incidence rate of the disease.

      The chance of having a positive test raises your chances of having the disease by 20 times. But the absolute chance is still small.

  4. Charles says:

    You err when you plug numbers in the short way example. The true positive rate in the denominator is .95, not the .99 shown in your example.

    1. trevor says:

      No, the math is correct. The true positive rate has been given as 99% or 0.99. The percentage of people who don’t receive a false positive is 95%, but this is unrelated to the true positive rate and isn’t used in the calculations above.

  5. Charles says:

    Someone needs a math coach. 30.8% is not sligtly more than 1 in 3. 1 in 33 equals 33.33%. Who is the dummy?

    1. trevor says:

      To calculate your odds, you divide 100% by your probability. Thus 100% / 30.8% is 1 in 3.25, slightly more than 1 in 3.

  6. no says:

    one in three is 33.3%. one in four is 25%. one in four buying is less than one in three buying.

    Your math concerning “odds’ is correct, but you stated quite simply
    “slightly more than 1 in 3 will buy”

    This is incorrect.

    Slightly fewer than 1 in 3 will buy.
    But the odds are more, namely 1 in 3.25.

    1. trevor says:

      You are correct. I was wrong here. I just updated the text to reflect that fewer than 1 in 3 will buy. Thanks.

  7. alex says:

    Excellent article , thanks. Came across the concept of Bayesian statistics in a book by Daniel Kahnemann (thinking Fast and Slow) who gave an example of ‘in correct thinking’ and the correct answer, but did not give the worked example. Substituting the figures into the formula the answer is as per his book. Without the formula and applying what i thought would be logical I was about 5% out.

  8. Liz says:

    Thank you for this. I am a researcher with a basic knowledge of stats needing to learn some specialized advanced stats independently of classes, and this helped my understanding of Bayesian Nets immensely.

  9. Esa says:

    Studying philosophy as a hobby I bumped into Bayes’ theorem and it has haunted me for weeks. Your explanation helped clear out what Richard Carrier’s site messed up in my head. For comparison go to http://www.richardcarrier.info/CarrierDec08.pdf

    1. trevor says:

      Wow, thanks. Richard’s paper definitely is dense, though it looks like it has lots of valuable information. I’ll have to dig through it sometime and see what I can understand.

  10. txe says:

    Beautiful explanation seriously, but I still find here something I can fix. Please, correct me where I’m wrong. I got as, in a population of 100,000 people 1% is not 100 but 1000. So from that point on all figures change, namely:
    1000 have the disease
    990 test positive (99%)
    10 test negative (1%)

    99000 don’t have the desease
    4950 test positive (5%)
    94050 test negative (95%)

    so out of 100,000 people total
    5940 test +
    94060 test –

    990 tested positive and have the disease
    5940 tested positive in total. So
    990/5940=0.166666=16% chance of having disease if you tested positive.

    1. trevor says:

      In the example given in the article, the rate of people who have the disease is 1 in 1, 000 or 0.1%, not 1.0%, so the number of people in 100,000 is 100, not 1,000.

      1. txe says:

        Hi, yeap. Now I got it, thank you so much. I was just wrong, messed up with numbers.

  11. mike durham says:

    Since you’ve used a business and medical problems as models, I’ve wondered about doing the same for searching for runaways, finding missing persons, determining leads for unidentified bodies, and solving currently unsolved homicides. I am not a math whiz in any sense of the word. I’ve found though these areas of human behavior need analysis to advance the likelihood of a positive or logical outcome.

    However, in each of the areas, like in the search for the French airliner that went down in the Atlantic, a false positive, incidence rates, and true positives are not always accurately reported. What would the equation look like for multiples? In the French airliner problem, I believe a grid was developed where only one incident and incident rate was positive and the remainder of the grid cells was negative. How would this issue be handled in the equation?

    1. trevor says:

      I’m not an expert in Bayesian statistics, so I can’t speak to it’s application in the areas you mentioned.

      One thing to keep in mind with all statistics is that you need to break the problem down in such a way that a) you have multiple comparables and b) you can get accurate data on those comparables. Otherwise you have nothing to calculate statistics on. How you choose those comparables can be just as important as how you calculate the statistics.

      For instance, take the case of an unsolved murder. You might have someone who was murdered with a knife in L.A. in the afternoon in the park. Are your comparables all the other people murdered with a knife in L.A. in the afternoon in the park? Or are they just people murdered with a knife in the park, anywhere at any time? Or is the time more important?

      If you define too narrowly, you don’t have a big enough sample size to have a useful statistics. If you define using the wrong dimensions, you might wind up with useless or misleading statistics.

      For all statistics, defining the problem becomes key to whether the statistics have a practical application. And often better solutions come from reframing the problem in a new way.

  12. Mike Durham says:

    Thanks, Trevor. For your March 31 response. Unfortunately, I’ve acquired more data which has helped me define a portion of the problem with which I’m working. The volume of incidences are extremely low at this point. However, the statement of the problem is getting clearer. For instance, as related to the knifing in the park scenario you posited, time of day is generally related but geographic characteristics are extremely relevant. Then time of day becomes more of a piece of relevant data.

1 ping

  1. Matemáticas Bayesianas para Dummies. | ungatosinbotas says:

    […] que buscando encontré el artículo de T. Lohrbeer quien en el mismo punto que yo, expone en simple un artículo de Steve Miller en el que éste […]

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

«

»