Yesterday's topic was reviewing the domain of risk of bias. 

Remember that in GRADE RCT derived evidence starts at HIGH confidence and is rated down for deficiencies in one or more of the domains and the domains are:

  1. Risk of bias
  2. Inconsistency
  3. Indirectness
  4. Imprecision 
  5. Publication bias

Today it's time to review inconsistency.

One of the hallmarks of the scientific method is reproducibility. Science proceeds by testing a hypothesis, recording results and testing again. Findings must be replicated. The category of inconsistency is based on this concept.

If study results are not consistently replicated in multiple trials, we have lower confidence that the results are real. 

For example, if all studies have results of approximately the same strength and direction of effect (either all beneficial or all harmful at about the same level), then we generally accept they are consistent. If nearly all studies have results of approximately the same strength and in the same direction, but one or a few studies do not, yet the confidence intervals in those outlier studies overlap confidence intervals in the more uniform studies, then we may still consider this body of evidence to be consistent. 

So what does GRADE mean by inconsistency and how do you rate it?

Remember GRADE works in the negative, with guidance on inconsistency the same as that for risk of bias. For RCT derived evidence, the evidence rating starts at High and is rated down for inconsistent findings across studies. Note that GRADE review of inconsistency is specific to binary (not continuous) outcomes and relative (not absolute) measures of effect. When GRADing, reviewers should be evaluating risk ratios and hazard ratios (preferred) or odds ratios.

So the first step is to establish if there are differences in results across the studies. Are both the strength and direction of the results similar? Are the confidence intervals also similar and overlap? Statistical tests of heterogeneity should also support that the results are sufficiently similar to be combined. 

When evaluating inconsistency in the results, especially large inconsistency, it is important to first look for reasons for the inconsistency. Begin by looking for important differences in the populations, interventions, and outcomes assessed. Search for reasons the studies may have different results, such as differences in disease severity between studies or differences in dosing or delivery of the interventions. There may be important clinical reasons that the studies should not be combined. In that case, separate analyses may need to be performed for the different sub populations.

It's important to realize when the data suggest that they shouldn't be combined into a single summary measure.

If data can be meaningfully evaluated in a stratified analysis, it would not be necessary to downgrade for inconsistency.

Inconsistent results could also be associated with differences in study quality (risk of bias) between studies. Most systematic reviews will eliminate studies with severe risk of bias from the review, but measurement error could still account for differences in studies with moderate risk of bias.

If large inconsistency remains after evaluating the evidence for explanations, you will downgrade for inconsistency.

GRADE says judging the extent of inconsistency is based on a collective assessment of the similarity of the results, the overlap of the confidence intervals and the results of statistical tests for heterogeneity.

GRADE lists these 4 criteri to assess for downgrading. Downgrade when:

  1. Point estimates vary widely across studies.
  2. Confidence intervals show little or no overlap.
  3. Statistical test for heterogeneity - testing the null hypothesis that all studies are evaluating the same effect - shows a low P-value.
  4. The I squared - which describes the percentage of total variation across studies due to real differences instead of random error - is large. Determining what is large is a judgement. GRADE suggests an I squared less than 40% "is low", 30-60% "may be moderate", 50-90%" may be substantial", and 75-100% "is considerable". 

You know where to go for more detail - the GRADE HANDBOOK

TheEvidenceDoc November 16, 2015



GRADE points are intended to be brief summaries of the major concepts of rating evidence using the GRADE approach.

To summarize last week's blog series, here's your first list of what you've learned so far.

  1. GRADE summarizes the evidence by outcome across all the relevant studies.
  2. GRADE rates confidence in the body of evidence for each outcome as High, Moderate, Low or Very Low
  3. Confidence in the body of evidence can vary for each outcome, even when derived from the same set of studies.
  4. GRADE uses 5 domains to rate confidence in the evidence derived from randomized controlled trials.
  5. The 5 domains used to rate confidence in the evidence derived from RCT studies are independent and given equal weight.
  6. Each domain can be downgraded by one or two levels.
  7. GRADE starts evidence derived from randomized controlled trials at High and rates down for deficiencies in any of the domains.
  8. Downgrading by 3 levels results in a Very Low rating, which is the lowest rating no matter how many reasons for downgrading exist.
  9. Study quality, or risk of bias, is only one of the 5 domains. As such, it only accounts for 20% of the overall GRADE.

More GRADE points in future blogs. 

And you can always read ahead and get more detail in the GRADE HANDBOOK

TheEvidenceDoc November 13, 2017




In yesterday's first lesson of the mini- series, I presented the 5 domains GRADE uses to rate down your confidence in the quality of evidence derived from Randomized Controlled Trial studies (RCTs). Perhaps you've correctly surmised that GRADE starts RCT evidence at the highest level and then subtracts, or downgrades, for insufficiencies in that evidence for each specific outcome. 

Those 5 domains for downgrading are:

  1. Risk of bias = limitations in the design and conduct of the studies that impact validity
  2. Inconsistency = lack of reproducibility of the effect across multiple studies
  3. Indirectness = in any of the PICO elements; if tested in population that differs from the one of interest, if difference in the intervention itself, or use of surrogate outcomes
  4. Imprecision = when confidence intervals around the effect estimate include both benefit and harm and impact the clinical decision threshold
  5. Publication bias = difficult to assess, but GRADE provides some indicators for suspicion that positive studies have been selectively published for the topic

Are you wondering how the downgrading works? The starting point for RCT evidence is an overall rating of High, which is the highest rating possible under the GRADE approach.

Each domain can be downgraded by 1 or 2 levels depending on the seriousness of the deficiency. 

Each domain is considered equal.

So if the evidence is downgraded by one level for risk of bias and one level for inconsistency it would go from a rating of High to Low.

The rating levels in the GRADE approach are:

  • High
  • Moderate
  • Low 
  • Very low

So after being downgraded by 3 levels, the evidence rating cannot be further reduced.

Tomorrow I'll show you how to interpret these levels. Unless, of course you can't wait to find out. You know where to go for the answers - the official GRADE handbook

TheEvidenceDoc November 9, 2017