If Biology has been using Statistics wrong, how did it get so much right?

The Science Reform movement seeks to change how experimental research is being done. This was originally motivated by observations that questionable research practices were leading to an appallingly high rate of false or unreproducible findings in some branches of Psychology. Reform initiatives that were based on studies of scientific practice in Psychology are now being applied to scientific research in other fields. When practitioners in other fields push back, this is often met with an insinuation that they simply don’t want to do good science. But there could be other explanations. A recent commentary in Nature by David Peterson (1) raises a very important point: efforts to improve rigor in any discipline must start with understanding research practices within that discipline.

Take Biology, for example. Some complain that most biologists have been using statistical methods incorrectly. Based on nearly four decades of experience in experimental Biology research, I think this is true, at least in some subfields. In particular, many studies purport to use Null Hypothesis Significance Tests (NHSTs) and report “p values”, yet fail to use prospective study designs. This raises legitimate concerns about rigor and reproducibility, because researcher flexibility can greatly elevate false positive rates (2).

But it also raises an interesting question: if null hypothesis testing were the backbone of the scientific method, as some suppose; and if this method has not been used, or has been used mostly incorrectly, how has so much progress occurred? Knowledge in Biology has exploded in the past century, producing detailed models of basic biological processes at an ever-increasing pace. Real-world practical applications such as immune therapies for cancers or Sars-Cov-2 vaccines wouldn’t have been possible if these models were not essentially correct. So how is it that Biology is not drowning in a sea of mostly false positive results? In short my answer is that the hypotheticodeductive method is not the backbone of, nor even a substantial component of, the scientific method in experimental, basic research in many fields in Biology.

Here are just a few observations from my experience of Biology research that differ dramatically from the fields of psychology which the science reform movement was originally responding to.

First, in Biology the goal of research is often to develop concrete, causal, mechanistic models. Experiments thus arise within a strong theoretical context, and hypotheses are highly constrained by other information, such as knowledge about the components involved (e.g. specific proteins) and laws governing their behavior (e.g. biophysics or thermodynamics). Why does this matter? The capacity to “present anything as significant” in other fields depends critically on the freedom to advance arbitrary hypotheses about associations in the absence of theoretical basis or causal constraints.

Second, evidence for a model in Biology accrues over many experiments, explicitly seeking independent approaches that rely on different assumptions (e.g., biochemistry, genetics, structural biology, comparative biology). Provisional models become the basis of derivative experiments, which implicitly replicate earlier results. Thus, important conclusions at the level of the field do not hinge critically on any one study or a “significant” effect in a single experiment. Why does this matter? An “effect” in a single experiment in Biology has the humble epistemic status of one loop in a strip of Velcro, it is never sufficient to establish a “fact”.

Third, studies in Biology often rely on a tight iterative loop of experiment, observation, and model-revision for their steady progress (Figure). Researchers don’t just “peek” at the data along the way – they pore over it. Prospective study design is simply a bad fit for this workflow, which may explain why some biologists resist the idea of preregistration. In other words, perhaps a vast majority of experimental biology research is what some statisticians would call “exploratory” or “hypothesis generating”. If so, we should be focusing more on how to ensure or assess the rigor of inferences in this type of research.

**Figure. Iterative Workflow in Biology Research**. A. Within a single project, research involves a tight loop between models, experiments, and observations, iterating on the timescale of days to weeks and a social scale of one to a few researchers. For example, a chance observation may lead to a tentative model. An experiment is designed to test implications of that model. Interim results of the experiment are added to the growing observation set as collected, perhaps leading to revisions of the model or experiment. B. Many such “inner loops” are nested within a larger loop that iterates on a time scale of months to years and spans many laboratories, mediated by informal communication, conferences, and the literature. A healthy research field maintains a number of viable working models that are constantly being compared and refined in light of the aggregated observations. Over time, any individual element may be discounted (transparent squares); this need not impugn the contributing study. To the extent that all the tenable models share core components, these components represent emerging consensus in the field. No model is “final”, however, because novel experiments, observations, or models can be introduced at any time.

It is possible Biology’s historical success is not in spite of, but because of, flexibility in data collection and analysis. An under-appreciated point is that deviations from an original study design can either increase or decrease false positives, depending on the decision heuristics; and flexibility may serve other functions such as efficient search or avoiding false negatives. To determine whether and what kinds of harm or benefit are caused by researcher flexibility in Biology, it will be necessary to study the workflow of Biologists.

In conclusion, statistical malpractice absolutely should be corrected. But in some fields of Biology, the best correction for now may not be to enforce correct NHST methods, but rather to drop the pretense of using NHST, and instead do a better job describing the true epistemic basis (and status) of our conclusions, however complex or qualitative. Some ways Metascience could help Biology improve rigor include: (1) Documenting, codifying and validating the de facto decision paths and inference heuristics that have been successful in Biology in the past; (2) Identifying the failure points that in fact arise in Biology research, so we can focus on finding ways to avoid or detect those; (3) Identifying alternative statistical methods that are more compatible with the established, effective research workflow.

Let’s fix what’s broken without breaking what’s working.

(1) David Peterson (2021) “The replication crisis won’t be solved with broad brushstrokes” Nature 594, 151. doi: https://doi.org/10.1038/d41586-021-01509-7

(2) Simmons JP, Nelson LD, Simonsohn U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science. 22(11):1359-1366. doi:10.1177/0956797611417632

Leave a comment Cancel reply