ICEMAN for Meta-analyses
1 — Is the analysis of effect modification based on comparison within rather than between trials?
Effect modification suggested by a comparison between studies (subgroups of studies) are usually much less credible than effect modification suggested by a comparison within studies (subgroups of individuals).
An important concern with between-study comparisons is study-level confounding: an association observed between a study-level variable and an outcome may be confounded by other study-level variables.1, 2, 3, 4, 5, 6, 7, 8, 9, 10 The power to identify a true within-trial effect modification can be very low and an apparent effect modification might be largely driven by study-level confounding.9, 10, 11
Most common are aggregate-data meta-analyses in which analyses of effect modification are completely based on between-study comparisons, e.g. using meta-regression. Those analyses are at a high risk of study-level confounding and consequently lower credibility.
Sometimes, investigators combine within- and between-trial information using one of the following approaches2, 12: (1) estimate within- and between-trial effect modification separately, then combine both; (2) include a simple interaction term in a one-stage IPD meta-analysis; (3) first combine trials within subgroups, then compare summary effects between subgroups.
An analysis of effect modification is definitely free of study-level confounding if it is completely based on within-trial information, possible if all trials provide (or allow estimation of) within-trial effect modification and, in a separate step, one combines the estimates across trials.2, 12, 13 Alternatively, there are more complex methods available for individual-participant data meta-analyses.2, 12, 14
A survey of published IPD meta-analyses suggested that only a small proportion of analyses of effect modification separate within- from between-trial information; instead, most analyses seem to combine within and between trial information.2 Therefore, unless there is a statement to the contrary, analyses of effect modification in an IPD meta-analysis likely combine within and between trial information and might not be free of study-level confounding.
| Response options | Description |
|---|---|
| Completely between | Subgroup analysis or meta-regression comparing overall effects of each individual trial (typical for aggregate data meta-analyses) |
| Mostly between or unclear | Most information from overall effects; some trials providing within-trial subgroup information |
| Mostly within | Most trials providing within-trial subgroup information; or IPD analysis combining within and between trial information |
| Completely within | IPD analysis that separates within from between trial information (e.g. meta-analysis of interactions) |
Completely between — Example 1: A meta-analysis assessing the effect of inpatient versus usual care found patients undergoing orthopaedic focused rehabilitation had a substantially larger functional benefit than patients undergoing geriatric focused rehabilitation (interaction p = 0.01).15 The analysis was based on between-study comparison only and therefore at high risk of confounding.
Completely between — Example 2: An IPD meta-analysis based on three RCTs suggested that mobile phone text messages can improve adherence to antiretroviral therapy. Because the type of text message varied only between but not within studies, the significant interaction (p=0.01) reflects a between-study comparison at high risk of study-level confounding — even though individual participant data were used.
Mostly between — Example: A meta-analysis assessing the effect of preoperative chemotherapy for gastroesophageal adenocarcinoma on survival combined individual patient and aggregate data.16 The analysis suggested a potentially larger treatment effect in tumours of the gastroesophageal junction (interaction p=0.08). The apparent effect modification might be explained by study-level confounders, e.g. risk of bias.
Mostly within — Example: An IPD meta-analysis combined 13 trials comparing radiochemotherapy versus radiotherapy alone in patients with cervical cancer.17 The authors first pooled subgroup-specific effects of each trial, then applied a chi-square test for trend (p=0.017). This method combines within- and between-trial information and is therefore potentially affected by study-level confounding.2
Completely within — Example: A meta-analysis of individual patient data from 16 trials compared low intensity interventions for depression with usual care.18 The investigators chose a model that estimated the effect modification within each trial and separated out between-trial comparisons, including a forest plot illustrating the heterogeneity of effect modifications across trials.
2 — For within-trial comparisons, is the effect modification similar from trial to trial?
Credibility of effect modification increases if the effect modification has been replicated across independent studies. Replication provides the strongest protection against random error and decreases the likelihood of confounding.
If the item applies, it is helpful to quantify the magnitude of effect modification for each trial, e.g. by calculating a ratio of risk ratios.13
Note that this credibility consideration is different from assessing consistency (or heterogeneity) of treatment effects across studies (e.g. expressed by the I²-measure19).
| Response options | Description |
|---|---|
| Not applicable | No or only one within-RCT comparison available |
| Definitely not similar | Effect modification reported for ≥2 trials with clearly different directions |
| Probably not similar or unclear | Not reported for individual trials, or too imprecise to tell |
| Mostly similar | Reported for ≥2 trials, mostly similar direction but considerable differences in magnitude |
| Definitely similar | Reported for ≥2 trials, similar in direction, only some differences in magnitude |
Probably not similar — Example: An IPD meta-analysis combined 13 trials comparing radiochemotherapy versus radiotherapy alone in patients with cervical cancer.17 The authors reported the effect modification only for the combined dataset, not for individual trials. It was therefore not possible to assess consistency across trials.
Mostly similar — Example: A meta-analysis of individual patient data from 16 trials of low intensity interventions for depression.18 Considering the point estimates within the 16 trials, 12 suggested a direction consistent with the overall finding, 1 suggested no effect modification, and 3 were in the opposite direction but with wide confidence intervals.
Definitely similar — Example: An IPD meta-analysis of fixed-dose aspirin for primary prevention of cardiovascular events found a significant interaction with body weight.20 All six trials showed the same direction (more effective in lighter patients) with ratios of hazard ratios ranging between 0.5 and 0.9.
3 — For between-trial comparisons, is the number of trials large?
For analysis of effect modification based on between-study comparisons, the credibility increases with the number of studies (analogous to number of observations in a regression analysis). A large number of studies also increases the power of the analysis and improves modelling of between-study dispersion in a random effects model.21, 22, 23, 24
| Response options | Subgroup analysis | Continuous meta-regression |
|---|---|---|
| Very small | 1–2 in smallest subgroup | ≤5 studies total |
| Rather small or unclear | 3–4 in smallest subgroup | 6–10 studies |
| Rather large | 5–9 in smallest subgroup | 11–15 studies |
| Large | ≥10 in smallest subgroup | >15 studies |
Very small — Example: A meta-analysis comparing transcatheter versus surgical aortic valve replacement found a qualitative interaction (interaction p=0.01 using random effect model). The smallest subgroup included only two studies.25
Rather small — Example: In a meta-analysis investigating the effect of low-intensity pulsed ultrasound on bone healing, the subgroup of 3 studies at low risk of bias suggested no benefit (interaction p<0.001).26
Rather large — Example: In a meta-analysis assessing the effect of inpatient rehabilitation versus usual care, both subgroups included 6 studies per subgroup.15
Large — Example: A meta-analysis comparing interventions for preventing hospital readmission performed a subgroup analysis by number of activities. The small subgroup included 16 and the larger subgroup 26 studies.27
4 — Was the direction of effect modification correctly hypothesized a priori?
Credibility is higher if investigators correctly anticipated the direction of the effect modification, lower if they failed to anticipate a direction, and lowest if they anticipated the opposite direction.
Correct anticipation of an effect modification implies that investigators had a specific hypothesis in mind, usually based on a biologic or other causal rationale, or sometimes based on prior evidence. An explanation stated a priori is much more credible than a post hoc explanation. If post hoc, investigators had likely considered many possible explanations, thereby creating a multiplicity problem.28, 29, 30, 31, 32, 33, 34
Because meta-analyses are retrospective, investigators may already know the key trials and most promising effect modifiers when they plan the analysis.3 If so, this item loses some of its value if it suggests increased credibility: correct anticipation of direction may essentially be data-driven. The item is more relevant if none of the key trials has tested the effect modifier of interest, and if the analysis is completely based on between-trial comparisons.
| Response options | Description |
|---|---|
| Definitely no | Clearly post-hoc, results inconsistent with hypothesized direction, or biologically very implausible |
| Probably no or unclear | Vague hypothesis or hypothesized direction unclear |
| Probably yes | No protocol available but unequivocal statement of a priori hypothesis with correct direction |
| Definitely yes | Prior protocol available and includes hypothesis with correct specification of direction, e.g. based on biologic rationale |
Probably no — Example: An IPD meta-analysis of fixed-dose aspirin for primary prevention of cardiovascular events found a significant interaction with body weight.20 The paper does not clarify whether the effect modification was hypothesized a priori.
Probably yes — Example: An IPD meta-analysis combined three trials comparing high versus low PEEP in ventilated patients. A subgroup analysis suggested that higher pressure was associated with longer survival in patients with ARDS (interaction p=0.02). The authors explicitly stated that they correctly anticipated the effect modification in their protocol which, however, was not published.35
Definitely yes — Example: A meta-analysis comparing transcatheter versus surgical aortic valve replacement. The investigators had anticipated this interaction with correct direction in a published protocol.25
5 — Does a test for interaction suggest that chance is an unlikely explanation?
Credibility is higher if a statistical test for interaction or meta-regression suggests that chance is an unlikely explanation for the apparent effect modification. Credibility is lower if the test is compatible with chance, or if no test is available and impossible to compute.
If no interaction or meta-regression p-value is reported, it can sometimes be calculated based on the reported data. We anchored the response options around typical thresholds for p-values 0.05, 0.01, and 0.005, with a p-value of 0.005 or smaller representing the most credible category.
| Response options | Threshold |
|---|---|
| Chance a very likely explanation | Interaction or meta-regression p-value > 0.05 |
| Chance a likely explanation or unclear | p <= 0.05 and > 0.01, or no test reported and not computable |
| Chance may not explain | p <= 0.01 and > 0.005 |
| Chance an unlikely explanation | p <= 0.005 |
6 — Did the authors test only a small number of effect modifiers, or consider multiplicity?
Performing multiple tests is a major concern in the context of effect modification analysis. Because multiple tests increase the risk of a chance finding, credibility is higher if investigators tested only a small number of effect modifiers or statistically considered multiplicity.
Multiplicity issues can arise through multiple candidate effect modifiers, multiple time points, multiple scales, multiple outcomes, or multiple methods for testing the interaction. Assessment of multiplicity depends heavily on reporting, and retrospective statements about the number of pre-specified subgroup analyses are not always reliable.39
A potential limitation in meta-analyses is that investigators might have scanned key trials for promising effect modifiers before planning the meta-analysis. If so, a small number of tested effect modifiers might obscure potential multiplicity issues introduced in earlier selection processes in the individual trials.
| Response options | Description |
|---|---|
| Definitely no | Explicitly exploratory analysis, or large number of analyses (>10), and multiplicity not considered |
| Probably no or unclear | No mention of number, or 4–10 effect modifiers tested and not considered |
| Probably yes | No protocol but unequivocal statement of <=3 effect modifiers tested |
| Definitely yes | Protocol available and <=3 effect modifiers tested, or number considered in analysis |
Definitely no — Example: A meta-analysis investigating interventions to reduce early hospital readmissions reported results for 12 effect modifiers.27 The authors correctly highlighted the possibility of chance findings due to multiplicity.
Probably no — Example: In a meta-analysis assessing inpatient rehabilitation versus usual care, all reported meta-regression analyses were pre-specified in an analysis plan. Nevertheless, 9 effect modifiers were tested for 3 outcomes at 2 time points.15
Probably yes — Example: An IPD meta-analysis assessed the effect of adding whole brain radiation therapy to stereotactic radiosurgery in patients with brain metastases. The report includes an explicit statement that age was one of three pre-planned effect modifiers.40
Definitely yes — Example: A meta-analysis comparing the effect of low-intensity pulsed ultrasound versus sham ultrasound on bone healing. The investigators had pre-specified the analysis in the published protocol41 together with two other subgroup hypotheses. The low number of tested effect modifiers and the pre-specified definition makes multiplicity issues less likely.26
7 — Did the authors use a random effects model?
The credibility of claimed effect modification is higher if investigators used a random effects model within subgroups, allowing true effects to differ among studies within subgroups and allowing generalisation of results beyond the included studies; this is almost always the model that should be used.42, 43
The credibility is lower if investigators used: (a) a common effect (fixed effect, singular) model — implying all studies within subgroups are based on the same population;42, 43 or (b) a fixed effects model — implying results will only apply to the studies included in the subgroup but cannot be generalised beyond them.42, 43
Simulation studies have shown that failure to assume random effects increases the risk of false positive claims for both study-level and individual participant-level meta-analysis.14, 22, 24 A random effects model strengthens a test of interaction because a significant result is usually harder to achieve.3, 6, 22, 42, 44
If investigators state that they used a mixed effects model without further specification, it usually implies they used a random effects model for between-study differences within subgroups (appropriate) and a fixed effects model for between-subgroup differences (also appropriate6, 42, 43). Therefore, the appropriate answer is usually definitely yes.
The question also applies to individual-participant data meta-analysis, for which an empirical study has shown that most do not apply a random effects model.45
| Response options | Description |
|---|---|
| Definitely no | Fixed (or common) effect model explicitly stated |
| Probably no or unclear | Probably no random effects model, or unclear |
| Probably yes | Probably random (or mixed) effects model |
| Definitely yes | Random (or mixed) effects model explicitly stated |
Definitely no — Example: An IPD meta-analysis of aspirin for primary prevention of cardiovascular events. The authors explicitly state that they used a fixed effects model.20
Probably no — Example: An IPD meta-analysis combined 13 studies comparing radiochemotherapy versus radiotherapy alone in patients with cervical cancer.17 The authors did not explicitly report how they modelled between-study differences. Because they used a fixed effect model for the overall analysis, it is most likely that they also used a fixed effect model within subgroups.
Definitely yes — Example: In a meta-analysis assessing the effect of inpatient rehabilitation versus usual care, the authors explicitly specified a random effects model for between-study differences in the methods section.15
8 — Were arbitrary cut points avoided? (continuous variables only)
Categorising continuous effect modifiers is common2 but associated with problems.46, 47 In the context of meta-analysis, cut points can cause additional problems: if two studies assessed the same continuous effect modifier but used different cut points, it may be impossible to combine the within-study results in a meaningful way unless individual patient data are available. Credibility is low if investigators selected the best-fitting data-driven cut point.46, 48
Provided individual participant data is available, it is also possible to average functions across several studies and base conclusions on the resulting mean function (i.e. a meta-analysis of interactions49, 50).
See RCT Item 5 for full response option descriptions.
Probably no — Example: A meta-analysis investigating interventions to reduce early hospital readmissions reported a potential effect modification by the number of intervention components.27 The published protocol did not specify cut points and the investigators explicitly highlighted the exploratory character of the analysis.
Probably yes — Example: In a meta-analysis on inpatient rehabilitation versus usual care, the intervention was better in preventing nursing home admissions in patients younger than 80 than in patients older than 80 (p=0.045).15 According to the authors, the threshold was pre-specified.
Definitely yes — Example: An IPD meta-analysis investigated whether patients with ARDS benefit from higher PEEP ventilation strategies.50 A continuous analysis suggested a non-linear effect modification by degree of hypoxaemia. A previous analysis dichotomised the effect modifier and could not reveal the potential non-linear relationship.35
9 — Are there additional considerations that may increase or decrease credibility?
Similar to RCT Item 6, with additional meta-analysis-specific considerations.
Sensitivity analysis suggesting robustness51, 52, 53:
Example: A meta-analysis comparing the effect of low-intensity pulsed ultrasound versus sham on bone healing.26 In a sensitivity analysis requested by the editors, the investigators applied a stricter threshold for missing data (≥10%). Although different criteria led to reclassification of one trial, the effect modification remained significant (p=0.004).
Effect modification supported by external evidence:
Example: A meta-analysis comparing transcatheter versus surgical aortic valve replacement.25 A prior cohort study of 501 patients using propensity score matching had suggested that the transapical approach was associated with more adverse events and higher mortality.54
Dose-response effect across levels of the effect modifier:
Example: An IPD meta-analysis combined 13 trials comparing radiochemotherapy versus radiotherapy alone in patients with cervical cancer.17 A subgroup analysis based on tumour stage suggested that the relative benefit decreased with increasing tumour stage across three stages, suggesting a possible “dose-response” effect (chi-square test for trend, p=0.017).
Risk of bias of the main effects of the individual RCTs or the meta-analysis: A commonly used instrument to formally assess the overall risk of bias is the Cochrane risk of bias tool for individual trials55 and the ROBIS tool for systematic reviews.56 Note that reporting bias can be introduced if only some studies report an effect modifier but not others.57 Also, industry-funded trials are at higher risk of spurious claims of effect modification.58, 59, 60
Example: An IPD meta-analysis combined three trials comparing high versus low PEEP in ventilated patients with lung injury or ARDS.35 A subgroup analysis suggested that higher pressure was associated with longer survival in patients with but not in patients without ARDS (interaction p=0.02). Although the p-value provides only modest support against chance, the high methodological quality of all three trials is reassuring.
Exceptionally high power.23, 61
Persistence after adjustment for other potential effect modifiers62:
Example: An IPD meta-analysis of fixed-dose aspirin for primary prevention of cardiovascular events.20 The effect modification by weight remained when the investigators stratified their analysis by both weight and age.
Consistency across related outcomes:
Example: A meta-analysis comparing transcatheter versus surgical aortic valve replacement.25 The qualitative interaction was consistent across outcomes mortality, stroke, acute kidney injury, and bleeding.
10 — Overall credibility rating
The overall rating is a continuous visual analogue scale spanning four credibility areas, corresponding roughly to <25%, 25–50%, 50–75%, and >75% confidence that the apparent effect modification is true and not the result of chance or bias. The overall rating should be driven by the items that decrease credibility.
Recommended strategy:
- All responses definitely or probably reduced credibility or unclear → very low credibility
- Two or more responses definitely reduced credibility → maximum usually low credibility
- One response definitely reduced credibility → maximum usually moderate credibility
- Two responses probably reduced credibility → maximum usually moderate credibility
- No responses definitely or probably reduced credibility → high credibility very likely
| Credibility | Interpretation | Implication |
|---|---|---|
| Very low | Very likely no effect modification | Use overall estimate for each subgroup |
| Low | Likely no effect modification | Use overall estimate; note remaining uncertainty |
| Moderate | Likely effect modification | Use separate estimates; note remaining uncertainty |
| High | Very likely effect modification | Use separate estimates for each subgroup |
How to use ICEMAN provides more suggestions for using and presenting ICEMAN in context; Concept and scope of ICEMAN provides a more detailed justification why the scale is continuous and why low credibility suggests likely no effect modification.
Worked example — Meta-analysis (cervical cancer)
An individual patient data meta-analysis of 13 trials compared radiochemotherapy versus radiotherapy alone in women with cervical cancer.17 The authors report “a suggestion of a difference in the size of the survival benefit with tumour stage.” The credibility assessment suggested low credibility.
| Item | Response | Comment |
|---|---|---|
| 1. Within vs between | Mostly within | All trials provided IPD; authors likely combined within and between; mostly driven by within-study information |
| 2. Similarity across trials | Probably not similar | Effect modification within individual trials not reported |
| 3. Number of studies | Rather large | 13 trials; reduces risk of trial-level confounding |
| 4. Direction a priori | Probably no | No information provided |
| 5. Interaction test | Chance likely | p=0.017 for chi-square test of trend |
| 6. Number of modifiers | Probably no | ≥8 subgroup analyses; no published protocol; potential multiplicity |
| 7. Random effects model | Probably no | Not explicitly stated; fixed effect used for overall analysis |
| 8. Cut points | Not applicable | Effect modifier is not continuous |
| 9. Additional (optional) | Probably increases | Dose-response pattern across tumour stages; consistent across different outcomes |
| 10. Overall | Low | Consistency across studies unclear; p-value not very small, possibly inflated by multiple analyses and use of fixed effect model |