Tooth Fairy Science and Other Pitfalls: Applying Rigorous Science to Messy Medicine, Part 1
August 24, 2012
Part 1 of Harriet Hall, MD’s presentation from the 2009 Skeptic’s Toolbox conference.
Many of you know I wrote this book, entitled “Women Aren’t Supposed to Fly: the Memoirs of a Female Flight Surgeon. I thought I would share with you the REAL reason women aren’t supposed to fly.
That’s her purse hanging from the instrument panel.
This workshop is called the Skeptic’s Toolbox, and every year we talk about what makes some people skeptics and others not. You can take two people who believe in dowsing and explain the ideomotor effect and show how dowsing has consistently failed every properly controlled test. One person will accept the evidence and stop believing in dowsing. The other will ignore the evidence and continue to believe. Why is that? Bob Carroll, author of The Skeptic’s Dictionary has said that critical thinking is an unnatural act. Science doesn’t come naturally.
Jerry Andrus used to come to the Toolbox and show us his close-up magic tricks and optical illusions. He would tell us “The reason I can fool you is that you have a wonderful brain.” Our brain takes the odd contraption on the right, and from another point of view on the left it assembles it into a nonexistent, impossible box.
Psychiatrist Morgan Levy said, “Thinking like a human” is not a logical way to think but it is not a stupid way to think either. You could say that our thinking is intelligently illogical. Millions of years of evolution did not result in humans that think like a computer. It is precisely because we think in an intelligently illogical way that our predecessors were able to survive.
He also said “Scientists expend an enormous amount of time and energy going to school in order to learn how to undo the effects of evolution so that they can investigate natural phenomena in a logical way.” Education helps, but it isn’t enough. We all know highly educated people who are not skeptics.
Ray Hyman has suggested that skeptics are mutants. Has something new evolved in our brains to help us overcome our intelligently illogical thinking processes? Well, Ray, I’m here to tell you you are right. Skeptics ARE mutants, and I have proof.
This is a picture of my daughter Kimberly.
(Title Slide again). But to get back to the subject of my talk… I’m going to explain some of the pitfalls we encounter when we apply the scientific method to clinical medicine. I’m going to try to give you a feel for how I evaluate a study to see if it is credible. I’m going to suggest some questions you may want to ask the next time you visit your doctor. This is the Skeptic’s Toolbox, and I’m hoping to offer you some tools so that the next time you see a report in the media claiming that broccoli causes cancer you will have a better handle on how to evaluate the report, what questions to ask, and how to decide whether you should immediately stop eating broccoli.
Before we had science we had to rely on these things. Folk remedies, old wives’ tales, herb women, witch doctors, superstitions. The plague doctor in the upper left picture wore a beak-like mask stuffed with herbs in the belief that it would keep him from catching the disease. Instead of scientific trials we had only case reports and testimonials – for centuries we used fleams like the one in the upper right picture for bloodletting and we kept doing it because both patients and doctors told us it worked. The proclamations of authorities like Hippocrates and Galen were never tested or questioned, and their errors were passed down for centuries.
Eventually, we crossed the line from proto-science to modern science, for instance from alchemy to chemistry: Cartoon: “You’ve turned lead into gold? Good. Do it again, write a detailed description of how you did it, and submit it to peer review.”
Here’s a short history of medicine: 2000 BC: eat this root. 1000 AD: that root is heathen, say this prayer. 1850 AD: that prayer is superstition, here, drink this potion. 1920 AD: that potion is snake oil, here, take this pill. 1965 AD: that pill is ineffective, here, take this antibiotic – and then…
Back to square one: 2000 AD: that antibiotic is artificial, here, eat this root. [Cartoon: Natural remedies, snake oil.]But that’s only in some circles. Mostly we stick to science-based medicine today.
Scientific method is nothing special; it’s really just a way of thinking about a problem, forming a hypothesis, selecting one variable at a time, and testing it. It is the only discipline that allows us to reliably distinguish myth from fact. It can be as simple as trouble-shooting to find out why the radio isn’t working. Hypothesis: it’s not plugged in. Test: check the plug. And so on.
We have gradually learned better ways to test. The picture on the left is of Don Quixote. In the novel, he is assembling some old armor to wear on his quest and discovers that the visor is no longer attached to the helmet. He attaches it with ribbons, and when he tests it by striking it with his sword, the visor falls off. He re-attaches it more carefully and the second time he decides he doesn’t really need to test it again. Modern science has developed more rigorous ways to do tests.
Hard science is easy, soft science is hard. When you mix chemicals A and B you always get chemical C, and you can calculate exactly how much will be produced. But medical studies are more problematic. Every molecule of a chemical is the same, but humans are not all the same. We have genetic differences, and there are confounders like age, diet, alcohol, and concurrent diseases that may all influence the response to a treatment. Subjects in a study may not take all their assigned pills. In medical studies, A plus B may appear to equal C or D or E.
The first modern clinical trial was done by James Lind of the Royal Navy in 1747. Back then, the sailing ships went out for years at a time and sailors had no access to fresh foods. Many of them developed scurvy, where they became weak, unable to work, and had internal bleeding, bleeding gums, and other symptoms. Today we know this was due to a deficiency of vitamin C. But vitamins weren’t discovered for another 2 centuries. Lind had heard reports of successful treatment, and he had developed the hypothesis that scurvy was due to putrefaction and could be prevented by acids. He divided 12 sick sailors into 6 groups, kept them all on the same diet, and gave each group a different test remedy: a quart of cider, 25 drops of elixir of vitriol (sulfuric acid – I hope he diluted it in water before he gave it to them!), 6 spoons of vinegar, half a pint of seawater, two oranges and a lemon, or a spicy past plus a drink of barley. The winning combination was two oranges and a lemon. It worked, but he still didn’t understand HOW it worked. He tried sending ships out with bottled juice to save storage space, and that didn’t work – the bottling process heated the juice and destroyed the vitamin C.
Applying scientific method to medicine led to lots of great discoveries. Vaccines are probably responsible for saving more lives than anything else. Reliable birth control allowed women to take control of their lives and contribute to society in all kinds of occupations. Antibiotics reduced the toll of infections. We developed fantastic imaging methods (x-rays, ultrasound, CT, MRI, PET scans) (that’s Homer Simpson’s skull x-ray) that enabled us to see inside the living body and make diagnoses without waiting for the autopsy. Diabetes used to kill all its victims: insulin keeps them alive today. We can even transplant organs. And, since I mentioned birth control pills for the women, I’ll mention Viagra for the men. Some men probably think that’s one of medical science’s greatest inventions.
Modern medicine has accomplished a lot. Cartoon: “These machines sure are live-savers, doc. The noise annoyed me right out of my coma.”
But the greatest discovery of all was probably the randomized controlled trial (RCT). It was the method that enabled us to make most of those great discoveries.
The R in RCT is randomization. When comparing the responses of two groups, you want to make sure the groups are comparable. If you steered all the sick, old people into the treatment group and all the young, healthy ones into the placebo group, an effective treatment might appear to be a dud. To make sure there is no bias in group assignment, it’s best to use concealed allocation, where the researcher that assigns patients to groups 1 and 2 doesn’t know which is the placebo group; only another researcher knows that. Once you have assigned patients randomly to the groups, you still need to go back and check that the two groups really are similar.
In reports of well-designed studies, you will usually see a table like this.
Here’s a close-up. The details aren’t important. The point is that they looked at all kinds of parameters like age, sex, ethnic group, weight, height, blood pressure, heart rate, etc. and calculated statistical measures of how similar the groups were.
The second letter of RCT stands for Controlled. Treatments almost always “work” – even quack treatments often seem to work due to the placebo effect and due to the natural course of disease where some people improve without treatment. There is a phenomenon called the Hawthorne effect, where just being enrolled in a study leads to improvement. So instead of just treating subjects and showing that they improve, you need to compare them to a control group of patients who get a placebo, or a known effective treatment, or no treatment at all.
Here’s why that’s important. Almost any disease has a fluctuating course. Symptoms wax and wane unpredictably. Here’s an example of a woman with osteoarthritis pain in her knees. At the beginning of this particular month she hardly had any pain, then it got worse for a while, when it subsided again. This is what it does with no treatment at all. Let’s say she decides to take a pill for the pain. She’s not likely to try it during the first few days when the pain is hardly noticeable. She’s more likely to try it when the pain peaks. And look what happens.
Her pain goes away! Of course this is exactly the same as the previous graph with the left half cut off. But it sure is convincing. It really looks like whatever she did worked.
Now let’s say she gets a placebo effect from the treatment. The pain was going away anyway, but now it goes away really fast. Of course she’s going to conclude that the treatment worked wonders. So if you want to prove that a new treatment works, you’re going to have to show that it produces more improvement than the natural course of disease and the placebo response.
The best RCTs are double blind. The subject doesn’t know whether he’s getting the treatment or the placebo, and the researcher doesn’t know which he’s giving the patient. This minimizes any unconscious effects of bias. But even double blinding isn’t foolproof. Sometimes patients can guess which group they were in due to side effects. Sometimes they even take the capsules apart and check whether the contents taste like sugar. So you really need to do an exit poll, asking patients which group they thought they were in. If they can guess better than chance, you didn’t have an adequate placebo and your results are tainted.
Now let’s look at how these RCTs are used in practice to develop new drugs. Cartoon: “Just for kicks, let’s come up with something that has a good side effect.”
The first step is to decide what’s worth testing. If an Amazon explorer reports that a tribe chews a certain leaf to treat infections, you don’t just bring those leaves home and give them to people. First you might test it in a lab to see what the components were and see whether they suppressed growth in a bacterial culture. You might try it on animals with infections. In vitro is Latin for “in glass” and refers to lab testing with test tubes and Petri dishes. In vivo means in a living body.
Animal studies may not be valid. Cartoon: “I believe I have a new approach to psychotherapy, but, like everything else, the FDA tells me it first has to be tested on mice.” Animals are not always equivalent to humans. Aspirin would have been rejected on the basis of animal studies. It causes congenital defects in mice, but not in humans.
Once you’ve decided that a drug is worth testing, the next step is what we call Phase I trials. It looks for toxicity to determine if the drug is safe. You give a single small dose of the drug to healthy volunteers. If they have no adverse effects, you test larger doses over longer periods. Sometimes there are unpleasant surprises. A drug called TGN1412 monoclonal antibody was tested in animals and appeared to be safe. They took a dose 500 times lower than was found safe in animals and gave it to 4 healthy volunteers. They were all hospitalized with multiple organ failures and some required organ transplants to save their lives.
If the drug passes the Phase I safety trials, the next step is a Phase II trial to see if the drug is effective. Phase I is the first trial in humans; Phase II is the first trial in patients. It tests large numbers of patients with the target disease and compares different doses. Ideally, several trials are done with different patient groups.
If the Phase II trials show the drug works, the next step is a Phase III trial to see if it works better than other treatments. You compare the new drug to an older drug or to a placebo.
Placebo trials may not be ethical. Cartoon: “Half the diabetics were given the new drug and responded well. The other half got a placebo and went into shock.” If you have an effective treatment for a disease, you can’t risk patient’s lives by denying them that treatment and assigning them to a placebo group. We couldn’t do a trial today comparing appendectomy to a placebo for acute appendicitis.
The trials don’t end with approval and marketing. They continue with post-marketing Phase IV trials to see what more we can learn. The company may want to demonstrate that the drug works for other illnesses, or show that it works better than a competitor’s product.
And then there’s post-marketing surveillance. The number of subjects in a research trial is small compared to the number of patients who will be taking a drug after it is marketed. Some of them will be different from the people in the trials in various ways, for instance they may have concomitant diseases. The first rotavirus vaccine was taken off the market when they discovered that 1 in 10,000 children developed intussusception, a telescoping of the bowel that is life-threatening. 1 in 100,000 people who got the 1976 swine flu vaccine developed a paralysis called Guillain-Barré syndrome. How many people would you have to enroll in a premarketing study to detect a one-in-100,000 complication? One of the weirdest drug effects I came across was that in men taking a drug called Flomax for prostate symptoms, if they have cataract surgery they can develop a complication called “floppy iris syndrome.” No one could have predicted that!
A researcher named Ioannidis recently wrote a seminal paper showing that most published research findings are wrong. And he explained why.
There are any number of things that can go wrong in research. Here are some of them. If a drug company funds the research, it’s more likely to support their drug than if an independent lab does the study. If the researchers are true believers, all kinds of psychological factors come into play and even if they do their best to be objective, they are at risk of fooling themselves. People who volunteer for a study of acupuncture are likely to believe it might work; people who think acupuncture is nonsense probably won’t sign up. Maybe 3 studies were done and only one showed positive results and that’s the one they submitted for publication (the file drawer effect). Most researchers delegate the day-to-day details of research to subordinates. Sometimes the peons in the trenches are just doing a job and trying to please their boss. They may feed false data to the author or suppress information they know he doesn’t want to hear. Sometimes when you read the conclusion of a study and go back and look at the actual data, the data don’t justify the conclusion. The report can’t possibly contain every detail of the research – what are they not telling us? Were there a lot of dropouts? Maybe when it didn’t work, they quit, and only the ones who got results were left to be counted.
Here are just a few of the other things that can go wrong. Some countries only publish studies with positive results. In China, if you published a study showing something didn’t work you would lose face and lose your job. So I won’t trust any results out of China until they are confirmed in other countries. If there are only a few subjects, errors are more likely. If you study the net worth of 5 people and Bill Gates is one of the 5, you get skewed results. In general, the more subjects, the more you can trust the results. When they calculate the statistics, they can use the wrong method or make mistakes. They can misinterpret the findings. The file drawer effect is when negative studies are not submitted for publication; publication bias is when the journals are less likely to publish negative studies. Inappropriate data mining is when the study doesn’t show what they wanted, and they look at subgroups and tweak the data every which way until they get something that looks positive. Sometimes researchers outright lie and commit fraud to further their careers. Sometimes they get caught, sometimes they don’t.
In thinking about a study, there are several things to consider. Was the endpoint a lab value or a clinical benefit? The diabetes drug Avandia decreased the levels of Hemoglobin A1C, a blood test indicating that the disease is under control. Unfortunately, it increased the mortality rate. Good blood tests aren’t very important if you’re dead. We try to look for POEMS – Patient Oriented Evidence that Matters. Instead of looking at cholesterol levels, we look at numbers of heart attacks. What kind of study was done? RCT, cohort, case-control, epidemiologic? Was it blinded? Was the study well-designed? Did they use an intention to treat analysis to correct for dropouts?
How much background noise was there? If you look at enough endpoints you’re almost certain to find some correlation just by chance. In a study of Gulf War Syndrome they looked at veteran’s wives to see if the husbands had brought anything home to harm them. They found that wives were less likely to have moles and benign skin lesions. Obviously, that was just noise, not a sign that Gulf War Syndrome improves the skin of spouses. Small effects (a 5% improvement) are not as trustworthy as large effects (60% improvement). It’s always better if different types of evidence from different sources arrive at the same conclusion. You should never trust a single paper, but should look at the entire body of published evidence. And you should trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.
I want to mention the null hypothesis because it can be hard to understand – it has always bothered me because it seems sort of like a double negative. Instead of testing the claim itself – that X cures Y – you test the null hypothesis – that X doesn’t cure Y. There are only 2 options: you can reject the null hypothesis (more people are cured with X than with placebo) or you can not reject it (the same number of people are cured with X as with placebo). You can’t ACCEPT the null hypothesis, because it’s hard to prove a negative. If the null hypothesis is that there are no black swans, all it takes is one black swan to disprove it. No matter how many times you try to fly, you can never prove that you can’t fly. The more times you try it, the less likely, but science can never say absolutely never. It can only say that based on current evidence, the likelihood of a human being able to fly is so vanishingly small that no one in his right mind would jump off a cliff to try it. Of course, it remains open to new evidence and would be willing to reconsider if people could actually show that they could fly.
All this is very discouraging. If most published research is wrong, when CAN you believe the results? You can look for good quality studies published in good journals, studies that have been confirmed by other studies, studies that are consistent with other knowledge, and studies that have a reasonably high prior probability. As Carl Sagan said, extraordinary claims require extraordinary proof.
Look at this structure. I think you can see that the prior probability of your being able to take 12 dice and construct this is so low that you would not waste your time trying.
Clinical research usually uses the arbitrary level of p=0.05 as the cut-off for statistical significance. Research in physics demands much more. The p=0.05 level essentially means that if you repeat the trial 20 times, 19 are likely to show the same result. Or if something doesn’t work, it might appear to work in 1 out of 20 studies. Does this mean that a study that is statistically significant at p=0.05 is 95% likely to be correct? NO! It may be much less likely to be correct if there is low prior probability. To understand why, let’s look at Bayes’ Theorem.
Bayes’ theorem allows us to use prior probability to calculate posterior probability. Suppose there is a co-ed school having 60% boys and 40% girls as students. The female students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem. (1 in 4)
Here’s a table showing what this means for clinical research. In the top row, if the p-value is 0.05 and the prior probability is 1%, the probability that the results are correct is only 3%. If the prior probability is 50%, the posterior probability is still only 73%. In the bottom row, if the p-value shows a phenomenally high significance level of 0.001, and the prior probability is 1%, the posterior probability is still only 50%. Most of us think the prior probability of homeopathy studies is well under 1%. This shows why the homeopathy studies that claim statistical significance can’t be trusted.
If you don’t consider prior probability, you can end up doing what I call Tooth Fairy Science. You can study whether leaving the tooth in a baggie generates more Tooth Fairy money than leaving it wrapped in Kleenex. You can study the average money left for the first tooth versus the last tooth. You can correlate Tooth Fairy proceeds with parental income. You can get reliable data that are reproducible, consistent, and statistically significant. You think you have learned something about the Tooth Fairy. But you haven’t. Your data has another explanation, parental behavior, that you haven’t even considered. You have deceived yourself by trying to do research on something that doesn’t exist.
Ray Hyman’s categorical directive: “Before we try to explain something, we should be sure it actually happened.” Hall’s corollary is “Before we do research on something, we should make sure it exists.”
There’s been a lot of research on the meridians and qi of acupuncture. Here’s an early acupuncture patient. Cartoon: mammoth is getting spears in the butt and thinks his neck pain suddenly feels better.
In acupuncture fairy science, the hypothesis is that sticking needles in specific acupuncture points along acupuncture meridians affects the flow of qi, which improves health. There is no evidence that specific acupuncture points or meridians exist, and no evidence that qi exists. If it did exist, there’s no reason to think it could flow, or that sticking needles in people could affect that flow, or that the flow could improve health. With acupuncture, what could you use as a placebo control? People generally notice when you stick a needle in them, so blinding is very difficult. They have come up with ingenious placebos – comparing random points on the skin to traditional acupuncture points or using sham needles that work like those stage daggers, where they appear to be penetrating the skin but actually just retract into the handle. One recent study used toothpicks that pricked but did not penetrate the skin. No matter what control you pick, a double blind study is impossible, because the acupuncturist has to know what he’s doing. The best studies using sham acupuncture controls have consistently shown that acupuncture works better than no treatment, and that sham acupuncture works just as well as real acupuncture. It doesn’t matter where you stick the needle or whether you use a needle at all. The only thing that seems to matter is whether the patient believes he got acupuncture. The acupuncture fairy believer’s conclusion is that we know acupuncture is effective so sham acupuncture must be effective too. The rational conclusion is that acupuncture works no better than placebo.
I write for the Science-Based Medicine blog, where we make a distinction between evidence-based medicine and science-based medicine. Evidence based medicine simply looks at the published clinical research and accepts the findings. Science based medicine considers preclinical research, prior probability, consistency with the rest of the body of scientific knowledge, and the fallibility of most research. I like to think of it as this formula: SBM = EBM + CT (critical thinking)
One of the pitfalls in evaluating published studies is disease clusters. If the mean incidence of cancer is X, that means some communities will have fewer than X cases and some will have more than X. If you spill rice on a grid, there will be an average number of grains per square, but some squares will have 0 grains and some will have lots of grains. It can be very tricky to determine whether a cluster is due to chance or whether it represents an increased risk in that particular area. In the Love Canal incident, a panel of distinguished doctors recently reviewed the scientific findings to date. They issued a surprising verdict. In their view, no scientific evidence has been offered that the people of Love Canal have suffered "acute health effects" from exposure to the hazardous wastes, nor has the threat of long-term damage been conclusively demonstrated.
Another tricky thing is confidence intervals. The top of the columns shows the actual measurements found, and there is a 95% confidence that the true measurement falls somewhere on the red lines. In the example on the left, the lowest end of the red line for the brown column is still higher than the highest end of the red line for the white column, so we can be confident brown is really greater than white. In the example on the right, a level within the red line for the brown column is taller than a level within the red line for the white column, so it’s possible that brown might actually be greater than white.
One of the most common human errors is forgetting that correlation doesn’t mean causation. The rise in autism was correlated with a rise in the number of pirates, but I doubt if anyone thinks pirates cause autism or autism causes pirates.
The logical fallacy here is post hoc, ergo propter hoc: assuming that because B follows A, it was caused by A. It’s easy to see why this is a fallacy when you show that the rooster crows every morning, followed by the sun rising. There is a consistent correlation, but we all know it was not the rooster’s crowing that made the sun come up. But think about this: I took a pill, and I got better; therefore the pill made me better. Suddenly it seems perfectly reasonable – but we have to remember that it might be just another rooster.
Does this make sense? It’s from the Scientific American in 1858. A doctor studied the curative effects of light in hospitals, and found that 4 times as many patients were cured in properly lighted rooms than in darkness. He said this was “Due to the agency of light, without a full supply of which plants and animals maintain but a sickly and feeble existence.” The editors commented that the health statistics of all civilized countries had improved in the preceding century – may be because houses are better built to admit more light. I think most of us can see that this correlation did not prove causation – we can come up with other explanations.
A scientist named Hill came up with a list of criteria to determine whether a correlation from epidemiologic studies showed causation. I’ll illustrate with the example of smoking and lung cancer. There is no way we could do a randomized controlled study of smoking – you can’t divide children in two groups and make one group smoke for decades and the other not, and you certainly couldn’t have a blinded study. So we had to approach the question by other routes. 1. There was a temporal relationship: people smoked first and got lung cancer later. 2. There was a strong relationship – lots more smokers than nonsmokers got lung cancer. 3. There was a dose-response relationship – the more cigarettes smoked, the higher the rate of lung cancer. 4. The results of various kinds of studies were all consistent. 5. The mechanism was plausible: we know there are cancer-causing compounds in cigarette smoke. 6. Alternate explanations were considered and ruled out. 7. They did experiments where they exposed lab animals to cigarette smoke and the animals developed cancer. 8. Specificity: cigarettes produced specific types of lung cancers, not a mixture of various unrelated symptoms. 9. Coherence. The data from different kinds of epidemiologic and lab studies and from all sources of information held together in a coherent body of evidence.
Cartoon: Tobacco Industry Research Centre: “Excellent health statistics – smokers are less likely to die of age related illnesses.”
Media reports of medical studies usually make findings sound bad by citing relative risk reduction rather than absolute risk reduction. Rather than knowing percentages, we want to know how many people are harmed or helped.
Here’s an example. A study showed that the risk of breast cancer increases by 6% for every daily drink of alcohol. The relative risk is 6%. So would 2 drinks a day give 12% of women breast cancer? NO. 9% of women get breast cancer overall. In every 100 women, 9 will get breast cancer. If every woman had 2 extra drinks a day, 10 of them would get breast cancer. Only 1 woman in 100 would be harmed; the other 99 would not be harmed. The absolute risk is 1 in 100, not 12.
Another example. Eating bacon increases the risk of colorectal cancer by 21% (relative risk). 5 men in 100 get cancer in their lifetime. If each ate a couple of slices of bacon every day, 6 would get cancer – only one more man, for an absolute risk of 1 in 100.
A recent study showed that using a cell phone doubled the risk of acoustic neuroma (a tumor in the ear). The relative risk was reported as 200% and alarmed parents took their children’s phones away. But the baseline risk of acoustic neuroma is 1:100,000. 200% of 1 is 2. the absolute risk was 1 more tumor per 100,000 people. Acoustic neuroma is a treatable, non-malignant tumor. The lead researcher said she would rather accept the risk and know where her kids were. She let them keep their cell phones. She warned that the results were provisional, the study small, and that different results might be found with a larger study. She was vindicated when a later, larger study found no increased risk.
What’s wrong with this? A British Heart Foundation press release said “We know that regular exposure to second-hand smoke increases the chances of developing heart disease by around 25%. This means that for every four non-smokers who work in a smoky environment like a pub, one of them will suffer disability and premature death from a heart condition because of second-hand smoke.”
If 4 in 100 nonsmokers have heart disease, a 25% increase means 5 will have it. The risk is not 1 in 4, but 1 in 100.
As well as asking for absolute risk rather than relative risk, we can ask for NNT and NNH – the number needed to treat and the number needed to harm. When you use Tylenol for post-op pain, you have to give it to 3.6 patients for one to benefit. You have to treat 16 dog-bite patients with antibiotics for one to benefit: 15 out of 16 will take the antibiotics needlessly. The clot-buster drug tPA has to be given to 3.1 patients for one to benefit, and for every 30.1 patients, one will be harmed.
Lipitor is one of the statin drugs used to lower cholesterol and prevent heart attacks and strokes. When it is used for secondary prevention (in patients who already have heart disease) somewhere between 16 and 23 patients must be treated for one to benefit. When it is used for primary prevention (in patients who are at risk but don’t yet have heart disease) the NNT rises to somewhere between 70 and 250, depending on age, other risk factors, etc. Of every 200 patients taking the drug, one is harmed.
One cynic put it this way "What if you put 250 people in a room and told them they would each pay $1,000 a year for a drug they would have to take every day, that many would get diarrhea and muscle pain, and that 249 would have no benefit? And that they could do just as well by exercising? How many would take that?" This is an exaggeration, but it illustrates that these drugs should be used selectively, based on individual factors like a total risk assessment and the patient’s personal preferences to take his chances vs taking a drug for insurance.
We all tend to assume that a positive test means someone has a disease and a negative test means he doesn’t. But it’s not that simple. There’s no such thing as a definitive test. There are false positives, false negatives, and lab errors. The diagnostic value of a test depends on the pre-test probability that the patient has the disease. One lesson I learned over and over in my years of practice was never to believe one lab test. My mother was a case in point. On the basis of one sky-high blood glucose test, she was told she had diabetes. They wanted to start her on treatment, but I persuaded them to wait. We bought a home monitor and checked her regularly and never ever got a single abnormal reading. When the lab checked her again, she was normal. We still don’t know what happened – maybe her blood sample got switched with someone else’s.
If your mammogram is positive, how likely is it that you actually have breast cancer? They’ve done surveys asking this question, and most laypeople and even many doctors guess 90%. Actually it’s only 10%.
Mammograms are 90% accurate in spotting those who have cancer (this is called the sensitivity of the test). They are 93% accurate in spotting those who don’t have cancer (this is called the specificity of the test). 0.8% of women getting routine mammograms have cancer (this is the prevalence of the disease).
this means that of every 1000 women getting mammograms, 8 of them have cancer. Of those 8 women with cancer, 7 of them will have true positive results, and one will have a false negative result and be falsely reassured that she does not have cancer. 992 of the 1000 women do not have cancer. Of those, 70 will have false positive results and 922 will have true negative results. So in all, there will be 77 positive test results, and only 7 of those will actually have cancer – roughly 10%.
It gets worse. How many lives are saved by mammography? If 1000 women are screened for 10 years starting at age 50, one life will be saved. 2-10 women will be overdiagnosed and treated needlessly. 10-15 women will be told that they have breast cancer earlier than they would otherwise have been told, but this will not affect their prognosis. 100-500 women will have at least one false alarm, and about half of them will undergo a biopsy they didn’t really need.
It gets worse when you screen for multiple diseases. The PCLO Trial tested for cancers of the prostate, colon, lung, and ovary. After only 4 tests, 37% of men and 26% of women had false positives. After 14 tests, it went up to 60% and 49%. After 14 tests, 28% of men and 22% of women had undergone invasive tests to determine whether they really had cancer – this included biopsies, exploratory surgery and even hysterectomy.