Tooth Fairy Science and Other Pitfalls: Applying Rigorous Science to Messy Medicine, Part 2
August 24, 2012
Part 2 of Harriet Hall, MD’s presentation from the 2009 Skeptic’s Toolbox conference. Read Part 1 »
We hear a lot about risks, but the media seldom puts those risks into perspective for us. The swine flu had killed 263 people in the US as of July 10, 2009. Regular flu kills 36,000 each year. Smoking kills 440,000 each year. We should be far more afraid of cigarettes than of swine flu.
We are warned about the obesity epidemic and we are told the ideal body mass index is 19-25. For a BMI of 18 or lower, there are 34,000 excess deaths. For a BMI over 30, there are 112,000 excess deaths. But lo and behold, for BMIs only mildly elevated at 25-29, there are 86,000 FEWER deaths. Makes you wonder what “ideal weight” really means.
We all know how easy it is to lie with statistics. Here are some statistical benchmarks to keep in mind that will help you detect some of the lies. The US population is a little over 300 million, with 4 million births and 2.4 million deaths a year. Half of the deaths are from heart disease and cancer, and you can read the rest. The report that “more than 4 million women are battered to death by their husbands or boyfriends every year” can’t possibly be true because total homicides are only 17,000. 4 million exceeds the total 2.4 million mortality.
Another area of confusion is the difference between prevalence and incidence. Prevalence is how many people in a population currently have the disease; incidence is how many people are newly diagnosed each year. One of my colleagues on the blog was discussing autism rates with an anti-vaccine activist. He cited the incidence figures for Denmark and his opponent tried to compare them to the prevalence figures in the US and ended up with egg on his face.
It’s very important to remember that probabilities are not predictions. Children who consistently spend more than 4 hours a day watching TV are more likely to be overweight. But that doesn’t mean you can make a child fat by making him watch TV. Or that you can make a child thinner by turning the TV off.
When there are a lot of small studies that don’t show statistical significance, sometimes it is possible to combine data to make one big study that does show statistical significance. Usually, the studies are not similar enough to justify doing that. If the small studies are not well-designed, combining them just makes the Garbage In Garbage Out (GIGO) problem worse. Systematic reviews examine all the published data using predetermined criteria and trying to weigh the quality of evidence. Because of the characteristics of a systematic review, if the results are negative they are probably true, but if they are positive they may still have a GIGO problem. Sometimes the reviewers are biased – they may have homeopaths reviewing the literature on homeopathy.
Homeopathy: the perfect placebo, nothing but water. Cartoon: Why homeopathy is such a bargain: just before you run out, use what’s left to mix up another gallon.
A systematic review of homeopathy studies showed that overall it worked better than placebo, but it didn’t work better for any specific condition. Whaaat? That’s like saying broccoli is good for everyone but is not good for men, women or children.
One of my readers educated me about how this can happen, and I thought it was neat so I’ll share it with you. It’s called Simpson’s paradox. Looking across the first row, testing treatments for symptom A, 30 patients were given placebo; 24 improved for an 80% improvement. 50 patients were given a homeopathic remedy and 40 improved, for an 80% improvement. In the rows for symptom B and C the same thing happened: the improvement percentages for placebo and homeopathy were exactly equal. But look what happens in the bottom row. When you add up rows A, B, and C, it looks like you got a 35% improvement with the placebo and a 48% improvement with homeopathy. This is statistical shenanigans, NOT evidence that homeopathy works better than placebo.
Poster of toilet: If water has a memory, then homeopathy is full of shit. Homeopathy: shit and sugar.
If a test is negative, how likely is it that the patient really doesn’t have the disease? If the test is positive, what is the likelihood that he really has the disease?
We can calculate likelihood ratios. Here’s a complex example, where the thickness of the lining of the uterus is measured on ultrasound to predict the likelihood of cancer of the uterus in women who have postmenopausal bleeding. If the thickness is less than 4 mm, the likelihood is 0.2%; if it is 21-25, the probability of cancer is 50%.
Some tests are better at ruling out disease and some are better at ruling it in. There is a blood test called the D-dimer test that is used to help rule out pulmonary embolus. If the doctor thinks there is a 10% probability that the patient has a PE, a positive D-dimer test raises the probability to 17%; a negative D-dimer test lowers the probability to 0.2%.
You have a sore throat. You get a throat culture. If the culture is positive, you have strep throat. If it’s negative, you don’t have strep throat. Right? WRONG! You might be an asymptomatic strep carrier, and your symptoms might be due to a virus. There could be a lab error. False positive and false negative test results are possible.
Rather than depending on a throat culture, we’ve developed some clinical decision rules. We can calculate a strep score based on points for age, exudate on tonsils, swollen lymph nodes, fever, and absence of cough.
If the score is 0-1point, the probability of strep is essentially zero and no culture or treatment is necessary. If the score is 2-3 points, the probability of strep rises to 17-35%, and we get a throat culture and give antibiotics if the culture is positive. If the score is 4-5 points, the probability is over 50% and we can just give antibiotics and not bother doing a culture.
Medicine is not an art or a science, but an applied science. It can be tricky to apply the evidence from published studies to the individual patient in the doctor’s office. Often there are no pertinent studies to go by. We can’t just use blind guesswork or intuition. We have to make the best possible judgment based on the best available evidence.
Medical science is flawed, but it’s better than the alternatives: testimonials, personal experience, belief-based treatments, or hypothetical, untested treatments. Cartoon: “But pastor, I need an antidote, not an anecdote.
If you don’t have a reliable way to evaluate evidence, you can end up being fooled by quackery like the blue dot cure. A quack can do anything useless and ridiculous, like painting a blue dot on the patient’s nose, and can “show” that it works. There are only 3 things that can happen: the patient can get better, worse, or stay the same. If he gets better, the quack claims that the blue dot worked. If he gets worse, the quack laments “If only you’d come to me sooner, the blue dot would have had time to work.” If the patients stays the same, the quack says “The blue dot kept you from getting worse; we need to continue the treatment.”
People are easily fooled into believing that quack treatments work. Cartoon: “Strong and blind belief is a virtue.” “OK then. I will strongly believe that you don’t know much of anything.”
Real science and pseudoscience may be hard for the layman to tell apart, but it’s important to know the difference, because…
Quackery takes your money.
Pseudoscientific explanations may seem to make superficial sense if you don’t know anything about science. Cartoon: Why does the sun set? It’s because hot air rises. The sun’s hot in the middle of the day, so it rises high in the sky.
In the scientific method, the scientist says “Here are the facts. What conclusions can we draw from them?” In the pseudoscientific method (here represented by the creationist method) the pseudoscientist says “Here’s the conclusion. What facts can we find to support it?” The scientist asks IF something works; the pseudoscientist tries to SHOW that it DOES work.
As Bill Nye, the Science Guy, says: Science rules!
Here’s a brief overview of how to read a study. The first thing is the abstract, a summary of what the study showed. It is only as good as the person who wrote it, and it may be inaccurate or it may draw conclusions not warranted by the data. The introduction explains what we already knew about the subject and why they decided to do this particular study to learn more. It usually cites previous studies, and if the authors are biased, it may cherry-pick the literature and mention only studies that support their bias. Then there is a Methods section that should describe in detail exactly what they did, so a reader could replicate their experiment in his own lab. Then they give their Results. They should give us the raw data so we can do our own analysis. This is where they apply statistical tests and tell us whether their results are statistically significant and at what level of significance. The Discussion section tells us what they think their data shows. If they are good scientists, they will point out the limitations of their study and try to anticipate the objections that others might raise. Then with the Conclusion, they will tell us what they think the implications of their study are. A good scientist will usually say something like “If other studies confirm these findings, this treatment may turn out to be clinically useful.” A bad scientist might say something like “We have proved this treatment works and everyone should start using it right away.” The references will be listed, and it’s worth checking them. Sometimes you can tell from the title alone that the reference has little to do with the claim it is meant to support. For instance, the statement “homeopathy cures rheumatoid arthritis” might be referenced by a homeopathy handbook or by a study entitled “The prevalence of a positive RA factor in a population with advanced rheumatoid arthritis.” I have even found references that attested to the exact opposite of the claim made in the article. Then, at the very end, there should be a mention of the source of funding and a disclosure of any conflicts of interest on the part of the authors.
I’m going to go over a couple of examples of how I go about evaluating a study, first from an abstract alone, and then from an entire study. Let’s say you read in the newspaper about a new study that shows that taking a multivitamin will reduce your risk of a heart attack. How can you know if the newspaper report has represented the study fairly? Most studies are listed in PubMed at www.pubmed.gov with their abstracts. You can usually get enough clues from the news report to use the search function and find the abstract. You will see something like this:
Don’t worry about the details. I’m just going to go through and mention some of the kinds of things I look for. The red highlighted phrases caught my attention. This was a study in Sweden. They took 1296 people who had had a heart attack (myocardial infarction or MI) and compared them to 1685 controls picked from the general population and matched to the patients by sex, age, and catchment area. They asked them to self-report their use of diet supplements, and they found that patients who had not had a heart attack were significantly more likely to have used diet supplements. This was a case-control study, which is less reliable than a randomized controlled study. It depended on recall rather than measuring in any way what the patients had actually taken. It did not try to assess whether any of the patients had been vitamin deficient. It only studied patients between the ages of 45 and 70, so any conclusions drawn from it might not apply to people older and younger than that. It was done in Sweden; they themselves point out that the consumption of fruits and vegetables is relatively low in that country, so it might not have implications for other countries with different diets. Of those who said they used supplements, 80% said they used multivitamins. What about the other 20%? Could they have been taking some other supplement that had a larger effect than vitamins? They checked for some possible confounding factors, and they found that “never smoking” outweighed the effect of vitamins in women. The abstract concludes “Findings from this study indicate that use of low dose multivitamin supplements may aid in the primary prevention of MI.” I don’t think the data support that conclusion at all. At any rate, the headline is clearly wrong. This study does NOT show that you should start taking a multivitamin to reduce your risk of a heart attack. More importantly, when you search the medical literature you don’t find any other studies showing that multivitamins prevent heart attacks; in fact, there are several studies showing that supplemental vitamins either have no effect or make things worse.
Now let’s go over an entire study. I picked this one: Dominican children with HIV not Receiving Antiretrovirals: Massage Therapy Influences their Behavior and Development. It was published in an online journal “Evidence-based complementary and Alternative Medicine (eCAM). The very title of the journal suggests bias, since the definition of CAM is that it is not supported by the kind of evidence that would lead to its adoption by mainstream medicine.
In short, they took children who were HIV positive and divided them into two groups, comparing massage therapy to play therapy. Play therapy was the placebo control they chose: you can judge for yourselves whether that is an adequate placebo.
The abstract explains that they studied 48 Dominican children between the ages of 2 and 8. They all had untreated HIV/AIDS. They were randomized to receive either massage or play therapy for 12 weeks. The massage group improved in self-help abilities and communication. Children over the age of 6 showed a decrease in depressive/anxious behaviors and negative thoughts.
The first thing I noticed was a number of careless errors. Caribbean was mis-spelled. Their phrase “for enhancing varying behavioral and developmental domains” was so vague as to be essentially meaningless. They said “A second objective of our work was to determine the absence of antiretroviral treatment on the impact of HIV infected Dominican children’s mood and behavior.” The abstract said the sessions lasted 20 minutes, while the text said 30 minutes. By themselves, errors like these may not mean much, but it makes me wonder if the researchers were as careless about their experimental methods as they were about writing up the results. It also makes me wonder what the peer reviewers and editors were doing – they certainly weren’t doing their job, or they would have caught and corrected errors like these.
In the Introduction section they offered a brief review of the literature. They summarized a few cherry-picked studies that didn’t really support their claims, they omitted other studies that contradicted their claims, and they didn’t address plausibility. The rationale for doing this study was unclear. They talked about massage enhancing immune function, but this study did not even try to measure immune function. It only looked for psychological and developmental effects.
In the Methods section, they described 30 minute sessions with either a standardized massage protocol or a play therapy protocol. The play therapy consisted of giving the child a choice of coloring/drawing, playing with block, playing cards, or reading children’s books. The parents were present throughout, and the parents’ reports were used to judge improvement.
They used a couple of scales to measure improvement. This one, the CBCL, measured these 8 items, and another scale measured several others. So we have a multiple end point problem and there is no evidence that they tried to correct for this.
The Results section reported that for children under 5 there were no significant differences between the massage and play groups. For children over 6, they reported “significant” improvement in the massage group for anxious/depressed behavior, negative thoughts, and overall internalizing scores. Look at the p values. The value for negative thoughts is p=0.059, higher than the usual cut-off of 0.05. Most scientists would not report this as “significant.” And note that among all the many things they measured, only two showed true statistical significance.
They seemed to be confused about the meaning of significance. They reported that 100% of the children in the play group showed an increase in their score on rule breaking behaviors, significant at a p=<0.05 level. They commented that “this significant change was not clinically meaningful.” How could they determine that? And if that wasn’t clinically meaningful, how did they determine that their other findings WERE clinically meaningful? They characterized a change in IQ data as “marginally significant for the massage group at p=0.07. That’s like being a “little bit pregnant.” The cut-off for significance is p=0.05, and most researchers would simply call anything over the cut-off not significant.
On the Developmental Profile scales, they found that the massage group improved in self-help and communication. The play group improved in social development, while the massage group showed decreased social development. They found no significant differences in the physical or academic scores. There is a bar graph printed in the article that appears to show that the play group did better with a gain of 9 points compared to 8 for the massage group.
In the Discussion section they said that “massage therapy was effective in reducing maladaptive internalizing behaviors in children aged 6 and over” and “children 2-8 years of age who received massage demonstrated enhanced self-help and communication skills.” They found it “interesting” that children in the massage group remained at the same social developmental level, suggesting that it was because those children had little or no play activity at home. (Did they? We don’t know.)
They were “puzzled” by their failure to find any effect on behaviors in the under-5 age group because they were so sure massage therapy improves children’s moods and anxiety levels. They tried to rationalize what might have gone wrong. They commented that “anecdotally, the nurses who conducted the massages reported changes in the children over time, including better mood.” This kind of anecdotal report is meaningless and has no place in an objective scientific study. It is a blatant attempt to make massage look better than what the data showed.
They recommended massage therapy as a cost-effective option to improve symptoms and functioning in children with untreated HIV. Sorry, but the study doesn’t even begin to support such a recommendation.
Why was this study really done? We already know that children need human interaction, play, touch, and TLC. Why massage? Does this study justify using massage as a “band-aid” on children who are denied life-saving anti-AIDS drugs?
I could have taken the data they collected and used it to support a very different conclusion: We studied a bunch of outcomes and found that massage is ineffective for all but a couple of them (and those are probably not clinically meaningful). One outcome was worse with massage (decreased social development).We did not show that massage is any better than TLC. Money would be better spent saving lives with effective drugs.
Cartoon: clown says “It may surprise you to hear that, actually, morphine is the best medicine.”
In the syllabus we have reprinted one top-of-the-line study and one worthless study. In your group discussion sessions, I’d like you to compare the two. Some of it is technical: don’t worry about the parts you don’t understand. Look for reasons one is better than the other.