ICEMAN for Randomized Controlled Trials
Item 1 — Direction of effect modification hypothesized a priori?
Item explanation: Credibility is higher if investigators correctly anticipated the direction of the effect modification (e.g. that an intervention is more effective in younger than in older patients), lower if they failed to anticipate a direction, and lowest if they anticipated the opposite direction.
Correct anticipation of an effect modification implies that the investigators had a specific hypothesis in mind — usually based on a biologic or other causal rationale, or sometimes based on prior evidence. In the Bayesian framework, the idea of a specific a priori hypothesis corresponds to using an informative rather than non-informative prior.1, 2
In addition, the item captures that an explanation stated a priori is much more credible than a post hoc explanation. If post hoc, investigators had likely considered many possible explanations, thereby creating a multiplicity problem.3, 4, 5, 6, 7, 8, 9 Hypotheses are most credible if verified in a prior, ideally a time-stamped protocol or analysis plan.
Note that statements of pre-specification may not be reliable,10 nor do they imply a specific hypothesis, nor do they preclude issues of multiple analyses.
Response options:
| Option | Description |
|---|---|
| Definitely no | Clearly post-hoc, results inconsistent with hypothesized direction, or biologically very implausible |
| Probably no or unclear | Vague hypothesis or hypothesized direction unclear |
| Probably yes | No protocol available but unequivocal statement of a priori hypothesis with correct direction |
| Definitely yes | Prior protocol available and includes hypothesis with correct specification of direction, e.g. based on biologic rationale |
Definitely no — Example 1: The ISIS-2 trial testing ASA conducted a provocative post hoc subgroup analysis comparing patients born under different astrological signs.11 ASA had a slight adverse effect in patients born under the sign of Gemini or Libra but a substantial benefit in others. The example became famous because the finding was obviously post hoc and not compatible with any biological model.
Definitely no — Example 2: In a trial comparing vasopressin versus norepinephrine for septic shock, the authors had hypothesized that the benefit of vasopressin would be larger in more severely ill patients. It turned out that the benefit seemed to be greater in the less severely ill patients (RR 1.04 in more severe vs 0.74 in less severe septic shock, interaction p=0.10). The investigators’ failure to correctly identify the direction weakens any inference about differential benefit.12
Probably no — Example: The investigators of the first large aspirin trial for patients with transient ischaemic attacks reported a beneficial effect in men but not in women.11 Although investigators may have planned a priori to explore subgroup effects by sex, they had not anticipated a specific direction based on biologic rationale. Subsequent studies and meta-analyses failed to replicate the subgroup effect.13
Probably yes — Example: A trial in patients requiring dialysis compared jugular versus femoral catheterisation and found no significant difference with respect to catheter colonisation.14 An analysis of effect modification suggested that jugular catheters were superior in patients with high BMI but inferior in patients with a low BMI. The authors claimed that they had correctly anticipated the direction but there was no protocol available.
Definitely yes — Example: A trial comparing reamed versus unreamed intramedullary nails for tibial shaft fractures suggested that reamed nails were superior for closed but potentially inferior for open fractures.15 The investigators correctly anticipated the direction in their published protocol based on a biologic rationale: damage of endosteal blood supply through reamed nails may be more detrimental in open than in closed fractures.16
Item 2 — Effect modification supported by prior evidence?
Item explanation: Credibility is higher if the effect modification is supported by prior direct or indirect evidence, lower if observed for the first time (often labelled as exploratory), and lowest if inconsistent with prior evidence.
Replication, ideally in another RCT, makes chance a less likely explanation for an apparent effect modification. Attempts for replication and successful replication, however, seem to be rare17 and prior evidence will be mostly unclear. Direction plays an important role: if two trials show different directions of effect modification, this markedly reduces credibility.
Response options:
| Option | Description |
|---|---|
| Inconsistent with prior evidence | Prior evidence suggested a different direction of effect modification |
| Little or no support or unclear | No prior evidence, or consistent with weak/very indirect prior evidence (e.g. animal study at high risk of bias), or unclear |
| Some support | Consistent with more limited or indirect prior evidence (e.g. large observational study, non-significant effect modification in prior RCT, or different population) |
| Strong support | Consistent with strong prior evidence directly applicable to the clinical scenario (e.g. significant effect modification in related RCT) |
Inconsistent — Example: A trial investigated the efficacy of neratinib in women with breast cancer.18 The authors mention two related trials that showed similar survival benefits irrespective of hormone receptor status,19, 20 and three other related trials that suggested greater survival in hormone receptor-negative patients, i.e. the opposite direction of effect modification.21, 22, 23
Little or no support — Example: A trial comparing jugular versus femoral catheterisation.14 The prior evidence was unclear because the cited cohort study24 provided no direct support for an interaction and was published after the trial had already started.
Some support — Example: A trial testing Epoetin Alfa for critically ill patients suggested a reduction in mortality in trauma patients.25 Although the interaction test was not significant (p=0.16), it was consistent with a similar effect modification seen in a previous RCT.26
Strong support — Example: A trial comparing low-carb versus low-fat diet found effect modification by insulin secretion (interaction p=0.01).27 A previous RCT found a similar, significant effect modification (interaction p=0.02).28
Item 3 — Does a test for interaction suggest that chance is an unlikely explanation?
Item explanation: Credibility is higher if a statistical test for interaction suggests that chance is an unlikely explanation for the apparent effect modification.29, 30, 31 Credibility is lower if an interaction test is compatible with chance — or no test is available and impossible to compute.
If no interaction p-value is reported, it can sometimes be calculated based on the reported data (point estimates of effect and confidence intervals in individual subgroups).35, 36 As a rule of thumb, the interaction p-value must be smaller than 0.05 if 95% confidence intervals of subgroup-specific estimates do not overlap.
We anchored the response options around typical thresholds for p-values 0.05, 0.01, and 0.005, with a p-value of 0.005 or smaller representing the most credible category. The response options recognise that p-value thresholds of 0.05 or even 0.01 may be too lenient for claiming statistical significance.37
Response options:
| Option | Threshold |
|---|---|
| Chance a very likely explanation | Interaction p-value > 0.05 |
| Chance a likely explanation or unclear | p ≤ 0.05 and > 0.01, or no test reported and not computable |
| Chance may not explain | p ≤ 0.01 and > 0.005 |
| Chance an unlikely explanation | p ≤ 0.005 |
Very likely — Example: A trial comparing prostatectomy versus observation for early prostate cancer found interaction p-values of 0.06 and 0.08 for potential benefits in specific subgroups.38 The most likely explanation is chance.
Likely/unclear — Example: The PLATO trial suggested Ticagrelor was inferior in patients from North America (interaction p=0.05).39 The p-value suggests that 1 in 20 trials may show such an effect modification or larger, even if not true.
May not explain — Example: A trial comparing reamed versus unreamed intramedullary nailing of tibial shaft fractures. An interaction p-value of 0.01 provided modest support against chance.15
Unlikely — Example: The CRASH-2 trial of tranexamic acid in trauma patients found that early treatment (≤1 h) RR 0.68, between 1 and 3 h RR 0.79, and after 3 h RR 1.44. The interaction p-value was smaller than 0.0001.40
Item 4 — Small number of effect modifiers tested, or multiplicity considered?
Item explanation: Performing multiple tests is a major concern in the context of effect modification analysis. Because multiple tests increase the risk of a chance finding,41, 42, 43 credibility is higher if investigators have tested only a small number of effect modifiers.
Multiplicity issues can arise in different ways,44 including multiple candidate effect modifiers, multiple time points, multiple scales,45 multiple outcomes, or multiple methods for testing the interaction.
An alternative to limiting the number of analyses is to statistically adjust the analysis for multiplicity, e.g. using correction of p-values considering the (familywise) type 1 error rate,46 testing all candidate effect modifiers in a common model, or shrinkage estimators.33, 47 All techniques inevitably reduce power.3, 48, 49
Assessment of multiplicity crucially depends on reporting; reporting guidelines for effect modification are available.50, 51, 52 Note that retrospective statements about the number of pre-specified subgroup analyses are not always reliable.10
Response options:
| Option | Description |
|---|---|
| Definitely no | Explicitly exploratory analysis, or large number of analyses (>10), and multiplicity not considered |
| Probably no or unclear | No mention of number, or 4–10 effect modifiers tested and not considered |
| Probably yes | No protocol but unequivocal statement of ≤3 effect modifiers tested |
| Definitely yes | Protocol available and ≤3 effect modifiers tested, or number considered in analysis |
Definitely no — Example: A trial assessing stroke risk after carotid endarterectomy tested more than 20 baseline factors for potential effect modification without adjustment for multiplicity.53 A subsequent trial suggested the opposite association,54 suggesting the original finding was a random result.
Probably no — Example: A trial comparing prostatectomy versus observation.38 The investigators did not take into account that they had tested at least seven effect modifiers for this outcome.
Probably yes — Example: A trial testing whether training of health professionals reduces caesarean deliveries. The effect modifier was one of two specified in the study protocol.55
Definitely yes — Example: A trial comparing endarterectomy versus medical therapy. The published study protocol specified only one effect modifier.53, 56
Item 5 — Arbitrary cut points avoided? (continuous variables only)
Item explanation: Categorising continuous effect modifiers is common57 but associated with a number of problems58, 59: cut points can introduce multiplicity, reduce power, mask linear or non-linear associations, and complicate comparisons across studies. Analyses that avoid cut points and make use of the full spectrum of values are the most credible.
Credibility is low if investigators selected the best-fitting data-driven cut point to maximise the effect modification. Such cut points are associated with a high rate of false positive claims.58, 60
There are some challenges when modelling continuous effect modifiers that are not part of the instrument but may lower the credibility: model misspecification can occur if the continuous relationship is driven by a few influential observations.61, 62, 63 Most credible are therefore continuous analyses for which investigators have pre-specified the type of dependency of the treatment effect on the continuous variable such as a linear or log relationship, or considered a small number of candidate functions.64
An alternative to use of cut points and potentially complex modelling is to consider overlapping subgroups (e.g. using a sliding window approach).65 The credibility is usually much higher than using arbitrary cut points but the interpretation can be difficult.
The credibility of a continuous analysis usually increases if investigators present a plot with confidence bands around the regression function and carefully checked the proposed model. Formal interaction tests for continuous effect modification are available and should be applied.64
Response options:
| Option | Description |
|---|---|
| Definitely no | Analysis based on exploratory cut point (e.g. picking cut point associated with highest interaction p-value) |
| Probably no or unclear | Analysis based on cut point(s) of unclear origin |
| Probably yes | Analysis based on pre-specified cut points (e.g. from prior RCT) |
| Definitely yes | Analysis based on the full continuum (e.g. assuming a linear or logarithmic relationship) |
| Not applicable | Effect modifier is not continuous |
Definitely no — Example: A trial tested the effect of everolimus in women with a specific type of breast cancer.66 The investigators dichotomised the continuous CIN score using the 75th percentile, explicitly selected because it “yielded the maximal difference in HR between high– versus low–CIN score subgroups.”
Probably no — Example: A trial comparing prostatectomy versus observation for early prostate cancer.38 The investigators hypothesised a potential effect modification by PSA value below or above 10, but provided no rationale for the chosen threshold.
Probably yes — Example: The CRASH-2 trial40 tested tranexamic acid in bleeding trauma patients. The investigators had pre-specified the three time-based categories in a published protocol.67
Definitely yes — Example: A trial comparing interferon-alpha versus medroxyprogesterone in patients with renal carcinoma.68 A subsequent analysis treated white cell count as a continuous variable, avoiding arbitrary cut points and maximising the power of the analysis. A plot of the continuous relationship with confidence bands was provided.69
Item 6 — Optional: Additional considerations
Item explanation: Methodologists have suggested a number of additional considerations that could be relevant for assessing the credibility of effect modifiers.70 They are not part of the core items because they either are less relevant, rarely apply, or are difficult to assess. The only response options are probably decreased and probably increased. Leaving this section blank does not affect credibility.
Sensitivity analysis suggesting robustness: A sensitivity analysis can help to increase the confidence in a proposed effect modification.71, 72, 73
Example: The CRASH-2 trial40 performed a sensitivity analysis treating time as a continuous effect modifier which suggested that results were robust, despite the pre-specified use of two cut points.
Dose-response effect across levels of the effect modifier: Credibility may be higher if effects increase or decrease monotonically with increases in the levels of the modifier, e.g. an effect that increases incrementally across three or more age groups.
Example: The CRASH-2 trial40 found that the effect on death due to bleeding decreased with increasing time from injury (interaction p<0.0001): early treatment (≤1 h) RR 0.68, between 1 and 3 h RR 0.79, after 3 h RR 1.44. The decrease across levels of the effect modifier strengthens the results.
Persistence after adjustment for other potential effect modifiers: Credibility may be higher if a multivariable analysis suggests that the apparent effect modifier is independent of other candidate modifiers.74
Example: The CRASH-2 trial.40 When including interaction terms for all four effect modifiers in a common model, the effect modification by time from injury remained highly significant (p<0.0001).
Risk of bias of the main effect of the RCTs: A commonly used instrument to formally assess the overall risk of bias is the Cochrane risk of bias tool.75 Some studies have suggested that industry-funded trials are at higher risk of spurious claims of effect modification than non-industry-funded studies, especially if the overall effect is not significant.76, 77, 78
Exceptionally high power: Methodologists have argued that the credibility of a proposed effect modification increases with its prospective power.49, 79 A rare situation of increased confidence would be a trial of over 10,000 patients with 80% power to detect a significant effect modification suggested in the study protocol.49
Consistency across related outcomes: Credibility might be higher if an effect modification is found for biologically related outcomes.
Example: In a trial of reamed versus unreamed nailing of tibial fractures, the difference in re-operations between smokers and non-smokers co-existed in quality of life outcomes.15
Item 7 — Overall credibility rating
Item explanation: The overall rating is a continuous visual analogue scale spanning four credibility areas, corresponding roughly to <25%, 25–50%, 50–75%, and >75% confidence that the apparent effect modification is true and not the result of chance or bias. The overall rating should be driven by the items that decrease credibility.
Recommended strategy:
- All responses definitely or probably reduced credibility or unclear → very low credibility
- Two or more responses definitely reduced credibility → maximum usually low credibility
- One response definitely reduced credibility → maximum usually moderate credibility
- Two responses probably reduced credibility → maximum usually moderate credibility
- No responses definitely or probably reduced credibility → high credibility very likely
| Credibility | Interpretation | Implication |
|---|---|---|
| Very low | Very likely no effect modification | Use overall estimate for each subgroup |
| Low | Likely no effect modification | Use overall estimate; note remaining uncertainty |
| Moderate | Likely effect modification | Use separate estimates; note remaining uncertainty |
| High | Very likely effect modification | Use separate estimates for each subgroup |
Section 4 provides more suggestions for using and presenting ICEMAN in context; section 5 provides a more detailed justification why the scale is continuous and why low credibility suggests likely no effect modification.
Worked example — RCT (CRASH-2)
A secondary publication of the CRASH-2 trial investigated the effect of tranexamic acid versus placebo on death due to bleeding in trauma patients.40 The investigators reported that “the effect of tranexamic acid on death due to bleeding varied according to the time from injury to treatment (test for interaction p<0·0001).” Although the investigators labelled their analysis as exploratory, an assessment using ICEMAN suggests moderate credibility.
| Item | Response | Comment |
|---|---|---|
| 1. Direction a priori | Probably yes | Subgroups (not direction) pre-specified in published protocol;67 authors stated they had correctly anticipated direction |
| 2. Prior evidence | Little or no support | Three cited papers represent expert opinion but no prior trial or cohort study showing a similar effect modification |
| 3. Interaction test | Chance unlikely | p < 0.00001 |
| 4. Number of modifiers | Definitely no | Four pre-specified modifiers but applied to other outcomes than pre-specified (labelled exploratory); however p is very small |
| 5. Cut points | Definitely yes | Two analyses: pre-specified cut points AND a continuous analysis; plot with 95% confidence bands provided |
| 6. Additional (optional) | Probably increases | Effect modification persisted after adjustment for other pre-specified modifiers (p<0.0001); dose-response pattern observed |
| 7. Overall | Moderate | Lack of prior evidence is a limitation; continuous analysis and very small p-value are reassuring |