Separating fact from fiction in chemical risk assessments: how reliable are evidence reviews?

June 29, 2015 at 9:26 am | Posted in Feature Articles | 1 Comment
Tags: , , ,

Fact or fiction?

In which the preliminary findings of new research suggests that the methods used for evaluating evidence for health risks from chemicals may, from the point of view of providing accurate accounts of what is and is not known about chemical toxicity, be in need of serious overhaul.


The importance of evidence reviews

It goes without saying, but evidence reviews play a fundamental role in distinguishing what is known in science from what is not. In chemicals policy, to prevent chemicals being used in circumstances with too great a degree of potential for causing harm, or harm being over-estimated such that unnecessary restrictions are imposed on chemical use, there is a premium on accurate identification of what is already known about a chemical’s toxicity and what remains to be determined, to inform appropriate risk-management decisions.

The problem is, it seems that for almost every conclusion of a review about the risks to health posed by a chemical, it is possible to find a dissenting view. A poster-child of this problem is bisphenol-A, the ubiquitous monomer used in manufacturing food can linings, till receipts and polycarbonate plastics. Although several thousand studies investigating its potential toxicity have been published, what all this research ultimately means for health risks is still dividing the scientific and regulatory communities. It does not take long to find contradictory conclusions from expert reviews of evidence about potential health risks posed by BPA, which include:

  • “low health concern from aggregated exposure” (EFSA 2015)
  • “a potential risk to the unborn children of exposed pregnant women [relating to] a change in the structure of the mammary gland” (ANSES 2013)
  • “a TDI for BPA has to be 0.7 μg/kg bw/day or lower to be sufficiently protective” (National Food Institute, Denmark 2015)
  • “BPA is safe at the current levels occurring in foods” (US FDA 2014)
  • “low dose effects have been demonstrated for BPA [at] 1–4 magnitudes of order lower than the current LOAEL of 50 mg/kg/day” (Vandenberg et al. 2014)

This conflict ought to seem a bit strange. All the groups quoted above have, at least in theory, access to the same information about the toxicity of BPA. And after so much published research, one might expect that some degree of consensus on health risks posed by BPA should be attainable. Yet we have five expert opinions on BPA safety which cannot be true all at the same time. How do we explain this?

One thing we can look at is how evidence is being reviewed by experts when they are coming to conclusions about the risks to health posed by a chemical.

To do this, we can define an ideal “gold standard” evidence review process which we can expect to produce consistent, valid results of what the available research says is and is not known about the risks to health posed by a chemical. If we find that the review practices employed by each expert group are different to the gold standard, it may be possible to explain not only how the differences in process are producing the differences in results, but also the extent to which these processes are producing credible answers regarding what is known about the health risks posed by a chemical.

What evidence reviews should do

One could fill several large textbooks with a detailed account of what should be done when reviewing evidence of risks to health posed by a chemical. The basics, however, are fairly simple: a review should start with a clear and unambiguous question of research importance relating to potential risks posed by the chemical in question; it should find all the evidence which might help answer that question; from that evidence, it should select (without cherry-picking) all the evidence of actual relevance to the research question.

Then, the included evidence should synthesized into an overall “average” answer about the effect which the chemical in question has on the health outcomes in which one is interested. This answer then needs to be qualified by the overall strengths and weaknesses of the evidence base. This is a complex sub-process, but includes assessing the evidence for its internal validity (i.e. the extent to which the evidence gives an unbiased answer to the question it poses), the extent to which the evidence gives a consistent answer to the question of interest, and how directly relevant the evidence is to the question being asked, among other features.

Research being conducted at Lancaster University (by the author of this piece, part of which CPES is supporting with a grant) breaks the ideal literature review process down into 19 components, against which the credibility of an individual review can be assessed. So how do existing review practices stack up against this gold standard?

Problems in how evidence is reviewed: how things look so far

So far, the results are preliminary. The toolkit for appraising the credibility of literature reviews is more-or-less finalized and an analysis of a test set of 20 systematic reviews has been conducted. Systematic reviews were chosen for the test set because they should, in theory at least, represent best practice in synthesizing evidence, at least more so than a random collection of traditional and systematic reviews.

The following charts present the results of the application of the toolkit to the test set. All the studies examine toxicological end-points except Klumper et al. which is a systematic review of the economic benefits of GM crop cultivation (it was appraised it out of curiosity). Click images to enlarge.

Individual study analyses

Results - Individual Studies

This heat map shows how each study in the test set performs against each domain in the ideal review process. There are four outcomes of analysis for each domain. (A) means the review has performed satisfactorily against the gold standard, with transparent documentation of methods and no apparent errors likely to mislead the reader or bias the results of the review. (B1) means the review did not document any method at all for the domain. (B2) means the review did document some methodology, but gave too little information as to allow its performance to be evaluated. (C) means the review used a method likely to mislead the reader or bias its results.

The basic lesson here is that the large majority of most reviews are too poorly documented to allow appraisal of the validity of the methods they use for synthesizing evidence. If we don’t know what the reviews have done, we cannot judge the credibility of their conclusions.

Domain performance frequencies

Preliminary Results - Test Set Frequencies

This chart makes it easier to see where the systemic challenges are. Generally speaking, the reviews in the test set state clear and justified objectives, declare interests and accurately summarize in abstract form their methods and results. The problem is, this is the easy part: in the methodological meat of the review process, we see serious challenges to either the validity of the reviews, or our ability to interpret that degree of validity.

There appear to be particular challenges to the validity of the test set reviews in the apparent lack of awareness of the importance of following pre-specified protocols when synthesizing research (protocols are rarely mentioned but important for preventing expectation bias affecting the review process), in the integrity of the search and selection processes (where reviews often seem at risk of cherry-picking the evidence they are appraising), and in the methods used for appraising the validity of included studies (such that reviews are unlikely to be consistently putting more weight on the better studies and less on the worse).


The results show that most of the reviews in the test set are of dubious scientific quality, typically documenting satisfactory methods in only six domains or less. Since most studies state a clear objective, declare interests and satisfactorily summarize in abstract form their methods and results, this leaves serious weaknesses in the conduct and reporting of the methodological meat of the reviews in the test set.

There is a serious question, then, as to the extent of the scientific value of many of these reviews. If their methods are insufficiently documented so as to allow them to be scrutinized for validity, or if the methods they employ could be misleading readers about the true size of potentially adverse health effects of the compounds they are investigating, to what extent can they be considered the sort of document which can arbitrate in the matter of what is and is not known about the toxicity of a given substance?

If what is happening in the test set were also to be true of what is happening with BPA, then we would not be able to determine why the various expert groups are disagreeing about the safety of BPA, because their methods would not transparent enough for us to be able to explain why they come to different conclusions. The next step in the project is therefore to evaluate several syntheses of evidence which have been conducted by regulatory scientific advisory committees, including the European Food Safety Authority’s recent final opinion on risks to health posed by BPA, and find out if this is the case.

1 Comment »

RSS feed for comments on this post. TrackBack URI

  1. […] Click here to read the full article in Health & Environment. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at
Entries and comments feeds.

%d bloggers like this: