Everyone’s research problem

December 18, 2013 at 5:18 pm | Posted in Feature Articles | Leave a comment

A short lesson in distinguishing better studies from worse

Error and STROBE

We all understand that the concept of the fair test is central to the practice of science. This is as true for simple school experiments, such as testing the effect that temperature has on the speed at which copper sulphate dissolves in water, as it is for conducting complex clinical trials to determine the effect that a drug has on speed of recovery of stroke patients.

The point of making sure the test is fair is to make sure it is the thing we are varying (be it water temperature or drug dose) which is determining the outcome of our experiments, rather than something else in the experimental set-up.

This concept of the fair test is important when we assess the safety of a compound such as bisphenol-A (BPA). Because we don’t want to judge a chemical as being safe when it is not (nor vice versa), and because some experiments are more likely to give correct answers than others, we need to assess the methodological quality of each individual piece of research about the safety of BPA to make sure we pay more attention to better studies and less attention to worse.

So far, so obvious. The surprise is that while we understand that experiments have to be fair tests, as a society it seems we are not necessarily very good at understanding that the process by which we evaluate experiments for methodological quality also has to be a fair test.

A fair test has to do two things: give a reasonably accurate result; and when different people do the same test, they need to come up with similar answers. When we are assessing the quality of a set of studies, this means a fair test of methodological quality has to be valid (i.e. studies which it says are worse actually have to be worse) and produce consistent results between evaluations.

Actually doing this is trickier than many of us might imagine: while 30 instruments for evaluating the quality of animal studies have been developed, only two of these have been vigorously tested for validity or consistency (Krauth et al. 2013). This means that we don’t know if a fair test is being used to evaluate whether or not an experiment is a fair test.

An illustrative case of the pitfalls in assessing methodological quality is given by an interesting and recent review of all published epidemiological research looking at the neurodevelopmental effects of prenatal and postnatal exposure to organophosphate pesticides (González-Alzaga et al. 2013).

This study uses compliance with STROBE criteria, a checklist which has been developed to strengthen reporting standards in epidemiological research (STROBE 2013), as a way to score the methodological quality of each study in the review.

There are nine reporting requirements in STROBE which relate to methods used in an epidemiological study. González-Alzaga et al. calculate a score based on the number of criteria which are met by each study in their review, to give a score out of 9. This is then simplified to a quality grade of low (meeting 1-3 criteria) medium (4-6 criteria) and high (7-9 criteria) for each study.

The first problem with this approach is it conflates the quality of reporting a study with quality of conduct of a study. Since what was said was done in a study may not reflect what was actually done, simply relying on what study authors report can lead to over- or underestimation of the quality of a piece of research.

Much more interesting, however, is in the use of STROBE as a quality scale. This gets to the heart of what quality appraisal is all about, and why it is so challenging.

What is the point of assessing the methodological quality of a study? To get a sense of whether or not we can believe its results, in other words the probability and extent to which it might have under- or overestimated effect size. So our metric for study quality should address precisely this.

The problem for González-Alzaga et al. is they have assumed that each STROBE criterion counts equally towards the overall credibility of a study. But why should this be the case? It is a stretch to assume that a failure to present key elements of study design early in a paper (one of the STROBE criteria) should count as much against a study as a failure to explain how loss to follow-up of cohort members was addressed (another of the criteria). [A full list of the criteria can be downloaded from the STROBE website, here.]

If we assume each quality criterion in fact has a different effect on study credibility, we can see how a checklist can fall apart as a proxy for measuring study quality (see figure).

Here we see that for a group of nine studies, three for each quality category, variance in study error is not correlated with the judgment of study quality: the high-quality category contains a study with a greater degree of error than two of the studies judged to be of lowest quality; and one of the medium-quality studies has the second-highest error margin.

To make it work as a checklist, the overall error which failure to meet each STROBE criterion introduces into a study would have to be measured. The temptation to combine criteria into a single judgement would have to be resisted, as it can easily result in studies judged to be of overall higher quality in fact giving worse results than studies judged to be of lower quality. It would also be important to make sure the STROBE criteria did not miss anything which can introduce error into a study’s results.

This might sound like a niche research problem, but in fact it is a problem for everyone: we are constantly presented with pieces of conflicting research, which we are then challenged with trying to decide which one is the more credible. Yet we are clearly still at a very early stage of understanding how to do this – as the recent Policy from Science Project report showed, not even the European Food Safety Authority has a robust method for distinguishing better research from worse when it evaluates the safety of compounds such as BPA (Whaley 2013).

Leave a Comment »

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.
Entries and comments feeds.

%d bloggers like this: