Weighing the Weight of Evidence

We burn witches, we also burn wood. Wood doesn’t sink in water, nor does a duck. If the woman weighs the same as a duck, then she is made of wood. The woman weighs the same as a duck. Therefore, the woman is a witch. Monty Python and the Holy Grail 1997

In Monty Python and the Holy Grail, this logic was used to determine that a woman was a witch after she sat on one side of the scale with a duck on the other side. Weighing the evidence was also used in the real world. In sixteenth-century Germany and Holland, suspected witches were weighed to determine if they were light enough to be able to fly on brooms.

Recently (2022), the National Academy of Sciences (NAS) weighed in on how EPA judges scientific evidence to determine “causality of health and welfare effects.” How we assess a body of science or other evidence to determine causality affects virtually every public decision we make in law or regulation. The question before us is whether current “weight of evidence” methodologies employed by courts and federal agencies should be replaced with advances in computational methods that identify hidden confounders that are more likely to be the root cause. An example is a hypothesis that obesity causes heart disease but diet quality, a confounder, affects both obesity and heart disease. 

Most regulatory and court decisions (other than criminal courts) attempt to assess evidence based on research weights, i.e., the “weight-of-evidence (WoE).” This makes intuitive sense but there are many instances where the scales aren’t fairly balanced. Decades of research by psychologists have shown convincingly that expert judgments tend to be vastly inferior to data-driven methods as guides to what is true (Pinker, 2021; Tetlock and Garner, 2015; Kahneman, 2011). Nevertheless, the NAS found that the current use of the WoE approach by EPA (and other regulatory agencies) was consistent with its use by courts and that such consistency warranted continued use. 

This decision appears to be based on historical expediency and needs to be revisited. EPA’s framework for weight of evidence has three basic steps:

  1. Assemble the evidence,

  2. Weight the items of evidence

  3. Weigh the body of evidence.

Assembling the evidence is not a straightforward task. For example, EPA is currently re-examining PFAS chemicals, a diverse set of chemicals widely used and now found in water, air, fish and soil. A search on “PFAS” (10/24/2022) in Google Scholar brings up an imposing 36,000 papers.

After selecting the studies to be reviewed out of tens of thousands of studies, weighing the evidence has been based on the persuasiveness of individual articles to those selected to review it, who may bring their own biases and worldviews to the task. Over the last decade or so, scholars have examined human studies (epidemiology) and found enormous variability in the quality of papers, particularly in determining causality. For example, chronic diseases, like cancer and heart disease are the result of multiple causes, sometimes influencing each other and determining or even adequately defining a single factor’s contribution to causality remains a challenge. A primary difficulty is that the exposure factor of interest may just be correlated with the disease and confounders could more likely be the actual cause. 

For example, for many years stress was considered to be the cause of peptic ulcers as the appearance of the ulcers was correlated with stress. However, it turns out that the more likely cause was actually a pathogen, helicobacter pylori. 

Finally, a serious issue arises when weighing a body of evidence based on academic research. To publish in a scientific journal, get funded and achieve tenure, academics face bias by reviewers and journals to accept only those results that report a positive exposure-response relationship, e.g., a chemical “causes” a disease. Positive associations, routinely conflated with causation, are also acceptable in scientific publications related to health risks (Gerrits et. al., 2019). One investigator analyzed 4,600 published papers between 1990 and 2007 and found that over 80% of all papers published supported the a priori hypothesis, or a positive relationship. This means that many likely high-quality studies that showed no association, or a negative association, are often never published. This unfairly weighs the scales in favor of papers that report a positive relationship.

An alternative to WoE, also evaluated by the NAS decision, is “probability of causation (PoC).” Using the same types of evidence used in WoE studies, PoC uses statistical techniques to manipulate a potential cause (e.g., to reduce exposure) to examine the health outcome of that manipulation. In doing so, it determines the likelihood (probability) of individual factors causing disease in light of all other possible causes and associations.

Despite acknowledging the problems with WoE, the NAS decided that there was “no evidence to show that application of more formal methods” would be more reliable “than the consensus approach” characterized by WoE. However, such evidence can be found in abundance in a massive research literature developed over the past 70 years rigorously comparing the performance of consensus judgments (e.g., WoE) to data-driven (e.g., PoC) approaches (e.g., Pinker, 2021; Tetlock and Garner, 2015; Kahneman, 2011; Meehl 1986 and 1954). 

This doesn’t end the discussion. If we are to avoid relying on periodic witch-hunting logic in the garb of “consensus science” and “weight of evidence” in our federal agencies (and courts) we need to look to sounder methods that will result in better judgment and policies.

Richard Williams