Steve Miller wrote an article a couple weeks ago on using Bayesian statistics for risk management. He describes his friend receiving a positive test on a serious medical condition and being worried. He then goes on to show why his friend needn’t be worried, because statistically there was a low probability of actual having the condition, even with the positive test.
Understanding risk is an interest of mine, and while I’ve read articles about Bayesian math in the past, the math is above my head. I never studied statistics, nor do I plan to. But I am interested in the concepts behind statistics, so I can understand probabilities better. And I can do basic math. Steve’s article was dense with math I didn’t quite get, but I was able to translate it into something I could understand.
So now, for statistically challenged individuals, I present my translation of Steve’s calculations, Bayesian math for dummies.
The Problem
Steve’s friend received a positive test for a disease. The disease occurs infrequently in the general population. The test accurately identifies people who have the disease, but gives false positives in 1 out of 20 tests, or 5% of the time. Should Steve’s friend be worried by his positive result?
In the example, we know four facts:
- Overall Incidence Rate
The disease occurs in 1 in 1,000 people, regardless of the test results. - True Positive Rate
99% of people with the disease have a positive test. - False Positive Rate
5% of people without the disease also have a positive test. - Present Condition
Steve’s friend had a positive test.
The question is, given this information, what is the chance that Steve’s friend has the disease.
Before he had the test, we’d just use the overall incidence rate, since we have no other information. Thus, his chance would be 1 / 1000 = 0.1%. Given that he’s received a positive test result, the True Positive Rate of 99% looks scary and a 5% False Positive Rate sounds too small to matter. But what are his actual chances of having the disease?
The Long Way
Bayesian math presents an elegant way to calculate the chance Steve’s friend has the disease. Steve presents the math in his article. But let’s do it the long way, which is much easier for me to understand.
To gain an intuitive understanding of the problem, I translated from abstract probabilities to actual numbers of people. This allows us to normalize the percentage rates so we can compare them. Because while it sounds like we can compare the Overall Incidence Rate, True Positive Rate and False Positive Rate of 0.1%, 5% and 99%, each of these rates apply to different sized groups. And, as we’ll see, the size of the group it applies to makes all the difference.
For these calculations, we’re going to look at a population of 100,000 people, all of which we’re going to assume took the test. Out of those people, how many have the disease and how many don’t?
100,000 | people total |
100 | have the disease (1 in 1,000 or 0.1%) |
99,900 | don’t have the disease |
Okay, 100 people have the disease. How many of these people tested positive or negative? Remember that we know that 99% of the people who have the disease test positive.
100 | have the disease |
99 | test positive (99%) |
1 | test negative (1%) |
Out of the 99,900 people who who don’t have the disease, how many of those tested positive or negative? Remember that 5% of those who don’t have the disease test positive anyway.
99,900 | don’t have the disease |
4,995 | test positive (5%) |
94,905 | test negative (95%) |
Now is where it gets interesting. How many people tested positive versus negative in our entire group?
100,000 | people total |
5,094 | test positive |
94,906 | test negative |
So 5,094 people tested positive, but we know only 99 of those actually have the disease. The probability of actually having the disease if you test positive is then:
99 | tested positive, and have the disease |
5,094 | tested positive in total |
1.94% | chance of having disease if you tested positive |
Which is the same result Steve arrived at, though with the much quicker Bayesian math.
The Short Way
For those who want a shortcut to arriving at this conclusion, I’ve translated Steve’s equation below.
Incidence_Rate * True_Positive_Rate |
( True_Positive_Rate * Incidence_Rate ) + ( False_Positive_Rate * ( 1 – Incidence_Rate ) ) |
Or, with the numbers from this example plugged in:
0.001 * 0.99 |
( 0.99 * 0.001 ) + ( 0.05 * ( 1 – 0.001 ) ) |
Which comes out to the same result: 1.94%
Business Applications
With this small understanding, what can you do with Bayesian math?
Let’s try a business example. Suppose you’ve been doing sales demos and you’re trying to determine how effective they are at closing business. Let’s say your close rate is 10%. You discover that 80% of buyers received a demo and only 20% of non-buyers received a demo. Clear and convincing evidence that demos work, right?
So what is the chance of someone buying if they see a demo? Let’s plug in the numbers:
0.10 * 0.80 |
( 0.80 * 0.10 ) + ( 0.20 * ( 1 – 0.10 ) ) |
The result: only a 30.8% chance, or slightly less than 1 in 3 people seeing the demo will buy.
That’s it for now. Tell me how you’re using Bayesian math in your business, or your ideas on how to apply this in the comments below.
20 comments
Keith McCormick says:
December 30, 2010 at 11:29 am (UTC -5)
I thought this was extremely easy to read. I have been looking for ways to get a groups I train warmed up for understanding Bayesian Networks. I will certainly move on to the original article as well. Thanks!
Scott Mutchler says:
February 16, 2011 at 11:22 am (UTC -5)
The 5% false positive rate is the key factor here. Steve probably assumed that only 5% of the positive results (like his) were incorrect not 5% of all the tests given.
Tom says:
November 11, 2011 at 12:04 am (UTC -5)
This is not what the numbers tell us. Steve has a 1 in 20 chance or a 95% chance of having the disease. No doubt. He would not have been given the test unless someone already hypothesized that he had it or he would not have had it. 1 in 20 is the ratio of people that had the test that didn’t have it. You and I have a 1.94% chance of having the disease if we have yet to be tested.
trevor says:
November 11, 2011 at 3:04 pm (UTC -5)
Steve’s friend does not have a 95% chance of having the disease. This is a common mistake people make, which was the point of Steve’s article. A 5% false positive rate tells you nothing about your chances of having the disease if you have a positive test.
The assumption provided in the math above is that all 100,000 people are tested. The 1.94% is the chance of actually having the disease if that test turns out positive, not of having the disease with not test. The chance of having the disease before you are tested is 0.1%, the incidence rate of the disease.
The chance of having a positive test raises your chances of having the disease by 20 times. But the absolute chance is still small.
Charles says:
February 6, 2012 at 8:09 pm (UTC -5)
You err when you plug numbers in the short way example. The true positive rate in the denominator is .95, not the .99 shown in your example.
trevor says:
February 13, 2012 at 10:17 am (UTC -5)
No, the math is correct. The true positive rate has been given as 99% or 0.99. The percentage of people who don’t receive a false positive is 95%, but this is unrelated to the true positive rate and isn’t used in the calculations above.
Charles says:
February 10, 2012 at 8:24 pm (UTC -5)
Someone needs a math coach. 30.8% is not sligtly more than 1 in 3. 1 in 33 equals 33.33%. Who is the dummy?
trevor says:
February 13, 2012 at 10:18 am (UTC -5)
To calculate your odds, you divide 100% by your probability. Thus 100% / 30.8% is 1 in 3.25, slightly more than 1 in 3.
no says:
March 15, 2012 at 1:33 am (UTC -5)
one in three is 33.3%. one in four is 25%. one in four buying is less than one in three buying.
Your math concerning “odds’ is correct, but you stated quite simply
“slightly more than 1 in 3 will buy”
This is incorrect.
Slightly fewer than 1 in 3 will buy.
But the odds are more, namely 1 in 3.25.
trevor says:
March 17, 2012 at 10:35 pm (UTC -5)
You are correct. I was wrong here. I just updated the text to reflect that fewer than 1 in 3 will buy. Thanks.
alex says:
July 11, 2012 at 5:13 am (UTC -5)
Excellent article , thanks. Came across the concept of Bayesian statistics in a book by Daniel Kahnemann (thinking Fast and Slow) who gave an example of ‘in correct thinking’ and the correct answer, but did not give the worked example. Substituting the figures into the formula the answer is as per his book. Without the formula and applying what i thought would be logical I was about 5% out.
Liz says:
August 26, 2012 at 10:18 am (UTC -5)
Thank you for this. I am a researcher with a basic knowledge of stats needing to learn some specialized advanced stats independently of classes, and this helped my understanding of Bayesian Nets immensely.
Esa says:
October 11, 2012 at 4:37 pm (UTC -5)
Studying philosophy as a hobby I bumped into Bayes’ theorem and it has haunted me for weeks. Your explanation helped clear out what Richard Carrier’s site messed up in my head. For comparison go to http://www.richardcarrier.info/CarrierDec08.pdf
trevor says:
October 11, 2012 at 9:45 pm (UTC -5)
Wow, thanks. Richard’s paper definitely is dense, though it looks like it has lots of valuable information. I’ll have to dig through it sometime and see what I can understand.
txe says:
March 15, 2014 at 11:40 am (UTC -5)
Beautiful explanation seriously, but I still find here something I can fix. Please, correct me where I’m wrong. I got as, in a population of 100,000 people 1% is not 100 but 1000. So from that point on all figures change, namely:
1000 have the disease
990 test positive (99%)
10 test negative (1%)
99000 don’t have the desease
4950 test positive (5%)
94050 test negative (95%)
so out of 100,000 people total
5940 test +
94060 test –
990 tested positive and have the disease
5940 tested positive in total. So
990/5940=0.166666=16% chance of having disease if you tested positive.
trevor says:
March 16, 2014 at 6:02 pm (UTC -5)
In the example given in the article, the rate of people who have the disease is 1 in 1, 000 or 0.1%, not 1.0%, so the number of people in 100,000 is 100, not 1,000.
txe says:
March 27, 2014 at 2:43 pm (UTC -5)
Hi, yeap. Now I got it, thank you so much. I was just wrong, messed up with numbers.
mike durham says:
March 30, 2014 at 10:25 pm (UTC -5)
Since you’ve used a business and medical problems as models, I’ve wondered about doing the same for searching for runaways, finding missing persons, determining leads for unidentified bodies, and solving currently unsolved homicides. I am not a math whiz in any sense of the word. I’ve found though these areas of human behavior need analysis to advance the likelihood of a positive or logical outcome.
However, in each of the areas, like in the search for the French airliner that went down in the Atlantic, a false positive, incidence rates, and true positives are not always accurately reported. What would the equation look like for multiples? In the French airliner problem, I believe a grid was developed where only one incident and incident rate was positive and the remainder of the grid cells was negative. How would this issue be handled in the equation?
trevor says:
March 31, 2014 at 9:38 am (UTC -5)
I’m not an expert in Bayesian statistics, so I can’t speak to it’s application in the areas you mentioned.
One thing to keep in mind with all statistics is that you need to break the problem down in such a way that a) you have multiple comparables and b) you can get accurate data on those comparables. Otherwise you have nothing to calculate statistics on. How you choose those comparables can be just as important as how you calculate the statistics.
For instance, take the case of an unsolved murder. You might have someone who was murdered with a knife in L.A. in the afternoon in the park. Are your comparables all the other people murdered with a knife in L.A. in the afternoon in the park? Or are they just people murdered with a knife in the park, anywhere at any time? Or is the time more important?
If you define too narrowly, you don’t have a big enough sample size to have a useful statistics. If you define using the wrong dimensions, you might wind up with useless or misleading statistics.
For all statistics, defining the problem becomes key to whether the statistics have a practical application. And often better solutions come from reframing the problem in a new way.
Mike Durham says:
August 21, 2014 at 5:45 pm (UTC -5)
Thanks, Trevor. For your March 31 response. Unfortunately, I’ve acquired more data which has helped me define a portion of the problem with which I’m working. The volume of incidences are extremely low at this point. However, the statement of the problem is getting clearer. For instance, as related to the knifing in the park scenario you posited, time of day is generally related but geographic characteristics are extremely relevant. Then time of day becomes more of a piece of relevant data.