Bad Statistics, and Bad Training, Are Sabotaging Drug Discovery

1/6/14

One of the most widely read college textbooks in the 1960s and ‘70s was How to Lie with Statistics by Darrell Huff. Despite the humorous title, the serious intent of the book (written by a journalist, not a statistician) was to illustrate how common errors in the use of statistics frequently lead to misleading conclusions.

Though popular, it may not have done the job it was intended to do. Almost 50 years later, author John Allen Paulos revisited this subject in his book Innumeracy: Mathematical Illiteracy and Its Consequences. Unfortunately, problems with the proper use of statistics appear to still be a serious and widespread concern. A recent paper by statistician Valen Johnson of Texas A&M University in College Station suggested that one out of every four published scientific studies draw false conclusions because they employ weak statistical standards.

The idea that many researchers simply aren’t running and analyzing their experiments properly appeared in yet another article focused on trying to understand why mouse models often don’t wind up being useful predictors of drug responses in human diseases. A survey of 76 influential animal studies “found that half used five or fewer animals per group”, and many failed to properly randomize mice into control and treated groups. In a similar vein, a recent study compared the data obtained from two research groups who were testing cancer cell lines for their susceptibility to anti-cancer drugs. While some of the drugs gave similar results in both studies, the majority did not.

These findings dovetail with the widely reported observations by researchers at Amgen and Bayer Healthcare, who were unable to reproduce the data in most of the high profile academic papers they tested. Their failure to replicate these experiments left them with a morass of untrustworthy data, and they decided not to move forward with plans to develop new medicines based on this information. This quagmire is territory where Big Pharma doesn’t want to find itself, given its increasing reliance on academia for new drug candidates , along with the widespread downsizing of many of their internal research programs.

Recognizing that this failure to replicate experiments was a serious problem in science, a for-profit group known as the Science Exchange put forth a potential solution known as the Reproducibility Initiative. I’ve argued previously that the Reproducibility Initiative has its heart in the right place, but that it will fail for a number of reasons. These include a lack of grant funding to pay for repeating the experiments as well as a number of other scientific and cultural issues. There is good news to report: a philanthropic organization contributed $1.3M to have 50 high profile cancer biology articles put through the validation wringer; results are expected by the end of 2014. The average cost of $26,000 per article to repeat certain key experiments from these papers is quite high; where funding would come from to pay for additional analyses is uncertain. The future of the initiative likely depends on how informative this pilot study turns out to be.

Other laboratory practices have recently been put under the microscope. A report from the Global Biological Standards Institute identified problems with standardization that were pervasive across both academic and industrial research settings. The authors say that irreproducibility “stems from undefined variance in reagents, practices, and assays between laboratories.” They advocate for the expansion of standard practices and reagents via educational programs and policy initiatives. Their report identified statistical analysis of data as one of many problematic areas, warning of “differences in statistical methods, including use of different mathematical approaches to analyze data or use of statistical approaches that might not be optimal for the particular data type.”

I’m not an expert on statistical methodology in the biological sciences, and statistics are not always required to make novel scientific discoveries. Certain problems in biology are focused on outcomes where statistics aren’t needed. Much of the work I did in my career was in cloning genes encoding previously undiscovered growth factors. The only possible outcomes were that I had cloned a new gene, or I hadn’t. While many of us trained in the biological sciences took a biostatistics class somewhere in our schooling, it is often a distant memory by the time we’ve landed a lead investigator position. One of my former colleagues once submitted a paper that came back with a reviewer’s comment that the statistical analysis should have included the Bonferroni correction. Neither he nor I had any idea what this was, but he was able to get assistance from one of our company’s statisticians to fine-tune the analysis. Knowing how to perform a particular statistical test isn’t sufficient; it’s equally important to know which statistical test is the best one to apply.

If reproducibility problems in the biological sciences are due to a mixture of poor statistical analysis and improperly done experiments, how do we fix this? And for all of the articles that we’ve read over the past few years pleading for new programs to get students amped up and interested in STEM careers, maybe a bigger problem is the insufficient training of a significant percentage of current scientists. These problems can’t be fixed quickly, but I’ll suggest two possible solutions that are not mutually exclusive:

Require Manuscripts to Have Solid Experimental Statistics

Journals, if they don’t currently engage such individuals, should add peer reviewers who are charged with the task of making sure the statistics in submitted studies are up to snuff. At least one reviewer on every paper should be someone who has been vetted by the editors as having a strong background in statistical analysis. Given that the federal government is funding a large percentage of research studies via the NIH, they should consider assembling an extramural statistical group to help grantees with both the design of their experiments and the analysis of their data.

Exactly where the funding for such a group would come from is unclear, and we know that academia is already suffering greatly because of the across-the-board federal budget cuts known as the “sequester.” Big Pharma should be willing to throw some money in this direction because its ability to develop new drugs is increasingly tied to the quality of the data coming out of academia. The industry currently provides at least 40 percent of the funding for the Tufts Center for the Study of Drug Development, a non-profit research group that analyzes various issues in the pharmaceuticals business.  Companies could easily endow an independent organization to take on the task of helping with the statistical analysis of data and developing research standards. This is especially true given the enormous increase in future industry revenues predicted by the IMS Institute for Healthcare Informatics assuming the successful adoption of the Affordable Care Act (aka Obamacare). Big Pharma companies already come together to form TransCelerate Biopharma, whose focus in on solving industry-wide problems. The issues outlined above could certainly be part of this group’s mandate.

Alternatively, researchers could turn to private contract research organizations who have the statistical talent on staff to evaluate the data. Figuring out how this additional statistical analysis would get paid for needs to be determined, but again it’s a logical place for Big Pharma to invest some money from their multi-billion dollar war chests. Industry contributions to such an effort (in the form of unrestricted grants) could be proportional to the sales income or profitability of individual members.

Provide Better Training in Experimental Design and Implementation

Most graduate programs in the biological sciences feature a wide variety of coursework; the classes you take depend on your specific area of study. Courses in biostatistics are commonly available, but are probably not widely required. Getting universities to offer a broader spectrum of statistics classes might be helpful in the long run, but this would do little to solve the problem in the short term. I wonder how many departments or graduate programs actually offer courses on how to properly design and execute experiments; perhaps they need to add them. I think it’s generally assumed that this subject is a mentor’s responsibility, but suppose they either don’t have the proper training themselves, or won’t take the time to teach the people in their labs? Having said that, I think that some procedures don’t readily lend themselves to fixing in the real world. The suggestion that researchers blind themselves so that they don’t know which animals get which treatments is one such example. This could readily be done in an industrial lab, but I’m having a hard time seeing how graduate students and post-docs, who frequently work alone while doing their experiments, could accomplish this task.

One other problem that may be affecting experimental design across numerous disciplines is the reduced availability of grant money. Increasing the number of animals per group, for example, may be statistically advisable but will also raise the cost of an experiment. Researchers may be cutting back on the size of their experimental groups as a way to save money without realizing that doing so puts the outcome at risk. While understandable, this practice winds up being penny wise but pound foolish if it results in the conclusions that are simply wrong and can’t be reproduced by others.

Several years ago I was asked to advise an academic group that wanted to start a new biotech company. They thought they had come up with a novel Alzheimer’s treatment using bioactive peptides. While reviewing their limited data I saw that they hadn’t included any negative control peptides in their experiments. Even a beginning graduate student should have recognized this was highly problematic. I inquired about this and was informed that they simply didn’t have the funds to purchase and include them in their study. I told them that investors were unlikely to back a company without this critical piece of data. Unfortunately, they simply couldn’t find the money to do these additional experiments, and the entire project was eventually abandoned.

Statistics can be used well or they can be abused. This observation was nicely captured in the old saw “There are three kinds of lies: lies, damned lies, and statistics.” The expression was popularized by Mark Twain (among others) to explain how politicians bolster weak arguments using dubious data. Valen Johnson’s analysis indicates that many scientists (who don’t know any better) also use statistics poorly, resulting in weak data that doesn’t hold up to careful scrutiny. There’s an old joke in the biological sciences: a researcher sharing his data with colleagues reported that “33 percent of the animals responded positively to the treatment, 33 percent of the animals showed no response, and the third mouse ran away.”

Given the current state of affairs, this joke isn’t sounding too funny anymore.

Stewart Lyman is Owner and Manager of Lyman BioPharma Consulting LLC in Seattle. He provides strategic advice to clients on their research programs, collaboration management issues, as well as preclinical data reviews. Follow @

By posting a comment, you agree to our terms and conditions.

  • David Miller

    A good start would be a specific certification/recertification requirement for CMEs specific to biostatistics. The number of practicing clinicians who cannot answer the following question correctly is staggering:

    Two identical trials and patient populations. Which drug is the superior drug for patients?

    Drug A: p-value = 0.001, Hazard Ratio=0.89
    Drug B: p-value = 0.01, Hazard Ratio=0.69

    That’s the easy question. Here’s the one that most everyone screws up, clinicians, researchers, and patients:

    Two identical trials and patient populations. Which drug is the superior drug for patients?

    Drug C: Median survival 2.0 months, Hazard Ratio = 0.49
    Drug D: Median survival 3.5 months, Hazard Ratio = 0.78

    • Anonymous

      I’m gonna say B & C. BTW I have absolutely no experience or background education in this field, but I do find it interesting and I’d like to know the right answer. Thanks, David.

      • david

        You’re right.

        In the first example, people often get confused that the p-value has anything to do with efficacy. It does not. All the p-value tells you is the degree to which the outcome seen in the trial was due to chance.

        The second example highlights median versus the entire population. The median tells you how one patient performed from each arm – the middle patient. That’s useful information, but the Hazard Ratio is a far superior measure as it described how the entire population performed — those who didn’t respond as well and those who responded really well. Especially when looking at interim (immature) data, the median can be pretty deceptive.

        • Anonymous

          Awesome. Thank You!!!

  • http://www.hollyip.com/ Suleman Ali

    When one is looking at statistical significance one can influence the p value a lot by choosing how to define the parameters. For example if you have a reading of 70 versus 65, with 60 in the control group, one can either compare 70 to 65 or you can subtract the control from both values and compare 10 with 5 to improve statistical significance. Statisticians can choose either to cooperate with such approaches or remain objective. However if the result is not significant it won’t get published.

  • jl

    interesting topic. One reason results don’t show up is also the wharped financial models used to pick cherries, and the incredible amount of costs loaded to product development.
    best