GRADE POINTS FOR CLINICAL GUIDELINE PANELISTS - INCONSISTENCY

Yesterday's topic was reviewing the domain of risk of bias. 

Remember that in GRADE RCT derived evidence starts at HIGH confidence and is rated down for deficiencies in one or more of the domains and the domains are:

  1. Risk of bias
  2. Inconsistency
  3. Indirectness
  4. Imprecision 
  5. Publication bias

Today it's time to review inconsistency.

One of the hallmarks of the scientific method is reproducibility. Science proceeds by testing a hypothesis, recording results and testing again. Findings must be replicated. The category of inconsistency is based on this concept.

If study results are not consistently replicated in multiple trials, we have lower confidence that the results are real. 

For example, if all studies have results of approximately the same strength and direction of effect (either all beneficial or all harmful at about the same level), then we generally accept they are consistent. If nearly all studies have results of approximately the same strength and in the same direction, but one or a few studies do not, yet the confidence intervals in those outlier studies overlap confidence intervals in the more uniform studies, then we may still consider this body of evidence to be consistent. 

So what does GRADE mean by inconsistency and how do you rate it?

Remember GRADE works in the negative, with guidance on inconsistency the same as that for risk of bias. For RCT derived evidence, the evidence rating starts at High and is rated down for inconsistent findings across studies. Note that GRADE review of inconsistency is specific to binary (not continuous) outcomes and relative (not absolute) measures of effect. When GRADing, reviewers should be evaluating risk ratios and hazard ratios (preferred) or odds ratios.

So the first step is to establish if there are differences in results across the studies. Are both the strength and direction of the results similar? Are the confidence intervals also similar and overlap? Statistical tests of heterogeneity should also support that the results are sufficiently similar to be combined. 

When evaluating inconsistency in the results, especially large inconsistency, it is important to first look for reasons for the inconsistency. Begin by looking for important differences in the populations, interventions, and outcomes assessed. Search for reasons the studies may have different results, such as differences in disease severity between studies or differences in dosing or delivery of the interventions. There may be important clinical reasons that the studies should not be combined. In that case, separate analyses may need to be performed for the different sub populations.

It's important to realize when the data suggest that they shouldn't be combined into a single summary measure.

If data can be meaningfully evaluated in a stratified analysis, it would not be necessary to downgrade for inconsistency.

Inconsistent results could also be associated with differences in study quality (risk of bias) between studies. Most systematic reviews will eliminate studies with severe risk of bias from the review, but measurement error could still account for differences in studies with moderate risk of bias.

If large inconsistency remains after evaluating the evidence for explanations, you will downgrade for inconsistency.

GRADE says judging the extent of inconsistency is based on a collective assessment of the similarity of the results, the overlap of the confidence intervals and the results of statistical tests for heterogeneity.

GRADE lists these 4 criteri to assess for downgrading. Downgrade when:

  1. Point estimates vary widely across studies.
  2. Confidence intervals show little or no overlap.
  3. Statistical test for heterogeneity - testing the null hypothesis that all studies are evaluating the same effect - shows a low P-value.
  4. The I squared - which describes the percentage of total variation across studies due to real differences instead of random error - is large. Determining what is large is a judgement. GRADE suggests an I squared less than 40% "is low", 30-60% "may be moderate", 50-90%" may be substantial", and 75-100% "is considerable". 

You know where to go for more detail - the GRADE HANDBOOK

TheEvidenceDoc November 16, 2015