ICEMAN for Randomized Controlled Trials

1 — Was the direction of effect modification correctly hypothesized a priori?

Credibility is higher if investigators correctly anticipated the direction of the effect modification (e.g. that an intervention is more effective in younger than in older patients), lower if they failed to anticipate a direction, and lowest if they anticipated the opposite direction.

Correct anticipation of an effect modification implies that the investigators had a specific hypothesis in mind — usually based on a biologic or other causal rationale, or sometimes based on prior evidence. In the Bayesian framework, the idea of a specific a priori hypothesis corresponds to using an informative rather than non-informative prior.^{1, 2}

In addition, the item captures that an explanation stated a priori is much more credible than a post hoc explanation. If post hoc, investigators had likely considered many possible explanations, thereby creating a multiplicity problem.^{3, 4, 5, 6, 7, 8, 9} Hypotheses are most credible if verified in a prior, ideally a time-stamped protocol or analysis plan.

Note that statements of pre-specification may not be reliable,¹⁰ nor do they imply a specific hypothesis, nor do they preclude issues of multiple analyses.

Response options	Description
Definitely no	Clearly post-hoc, results inconsistent with hypothesized direction, or biologically very implausible
Probably no or unclear	Vague hypothesis or hypothesized direction unclear
Probably yes	No protocol available but unequivocal statement of a priori hypothesis with correct direction
Definitely yes	Prior protocol available and includes hypothesis with correct specification of direction, e.g. based on biologic rationale

Definitely no — Example 1: The ISIS-2 trial testing ASA conducted a provocative post hoc subgroup analysis comparing patients born under different astrological signs.¹¹ ASA had a slight adverse effect in patients born under the sign of Gemini or Libra but a substantial benefit in others. The example became famous because the finding was obviously post hoc and not compatible with any biological model.

Definitely no — Example 2: In a trial comparing vasopressin versus norepinephrine for septic shock, the authors had hypothesized that the benefit of vasopressin would be larger in more severely ill patients. It turned out that the benefit seemed to be greater in the less severely ill patients (RR 1.04 in more severe vs 0.74 in less severe septic shock, interaction p=0.10). The investigators’ failure to correctly identify the direction weakens any inference about differential benefit.¹²

Probably no — Example: The investigators of the first large aspirin trial for patients with transient ischaemic attacks reported a beneficial effect in men but not in women.¹¹ Although investigators may have planned a priori to explore subgroup effects by sex, they had not anticipated a specific direction based on biologic rationale. Subsequent studies and meta-analyses failed to replicate the subgroup effect.¹³

Probably yes — Example: A trial in patients requiring dialysis compared jugular versus femoral catheterisation and found no significant difference with respect to catheter colonisation.¹⁴ An analysis of effect modification suggested that jugular catheters were superior in patients with high BMI but inferior in patients with a low BMI. The authors claimed that they had correctly anticipated the direction but there was no protocol available.

Definitely yes — Example: A trial comparing reamed versus unreamed intramedullary nails for tibial shaft fractures suggested that reamed nails were superior for closed but potentially inferior for open fractures.¹⁵ The investigators correctly anticipated the direction in their published protocol based on a biologic rationale: damage of endosteal blood supply through reamed nails may be more detrimental in open than in closed fractures.¹⁶

2 — Was the effect modification supported by prior evidence?

Credibility is higher if the effect modification is supported by prior direct or indirect evidence, lower if observed for the first time (often labelled as exploratory), and lowest if inconsistent with prior evidence.

Replication, ideally in another RCT, makes chance a less likely explanation for an apparent effect modification. Attempts for replication and successful replication, however, seem to be rare¹⁷ and prior evidence will be mostly unclear. Direction plays an important role: if two trials show different directions of effect modification, this markedly reduces credibility.

Response options	Description
Inconsistent with prior evidence	Prior evidence suggested a different direction of effect modification
Little or no support or unclear	No prior evidence, or consistent with weak/very indirect prior evidence (e.g. animal study at high risk of bias), or unclear
Some support	Consistent with more limited or indirect prior evidence (e.g. large observational study, non-significant effect modification in prior RCT, or different population)
Strong support	Consistent with strong prior evidence directly applicable to the clinical scenario (e.g. significant effect modification in related RCT)

Inconsistent — Example: A trial investigated the efficacy of neratinib in women with breast cancer.¹⁸ The authors mention two related trials that showed similar survival benefits irrespective of hormone receptor status,^{19, 20} and three other related trials that suggested greater survival in hormone receptor-negative patients, i.e. the opposite direction of effect modification.^{21, 22, 23}

Little or no support — Example: A trial comparing jugular versus femoral catheterisation.¹⁴ The prior evidence was unclear because the cited cohort study²⁴ provided no direct support for an interaction and was published after the trial had already started.

Some support — Example: A trial testing Epoetin Alfa for critically ill patients suggested a reduction in mortality in trauma patients.²⁵ Although the interaction test was not significant (p=0.16), it was consistent with a similar effect modification seen in a previous RCT.²⁶

Strong support — Example: A trial comparing low-carb versus low-fat diet found effect modification by insulin secretion (interaction p=0.01).²⁷ A previous RCT found a similar, significant effect modification (interaction p=0.02).²⁸

3 — Does a test for interaction suggest that chance is an unlikely explanation?

Credibility is higher if a statistical test for interaction suggests that chance is an unlikely explanation for the apparent effect modification.^{29, 30, 31} Credibility is lower if an interaction test is compatible with chance — or no test is available and impossible to compute.

Warning

Important: Showing that an effect is significant in one subgroup and not in another is of little use: it provides no information whether chance might explain differences in effects across subgroups.^{4, 30, 32, 33, 34}

If no interaction p-value is reported, it can sometimes be calculated based on the reported data (point estimates of effect and confidence intervals in individual subgroups).^{35, 36} As a rule of thumb, the interaction p-value must be smaller than 0.05 if 95% confidence intervals of subgroup-specific estimates do not overlap.

We anchored the response options around typical thresholds for p-values 0.05, 0.01, and 0.005, with a p-value of 0.005 or smaller representing the most credible category. The response options recognise that p-value thresholds of 0.05 or even 0.01 may be too lenient for claiming statistical significance.³⁷

Response options	Threshold
Chance a very likely explanation	Interaction p-value > 0.05
Chance a likely explanation or unclear	p ≤ 0.05 and > 0.01, or no test reported and not computable
Chance may not explain	p ≤ 0.01 and > 0.005
Chance an unlikely explanation	p ≤ 0.005

Very likely — Example: A trial comparing prostatectomy versus observation for early prostate cancer found interaction p-values of 0.06 and 0.08 for potential benefits in specific subgroups.³⁸ The most likely explanation is chance.

Likely/unclear — Example: The PLATO trial suggested Ticagrelor was inferior in patients from North America (interaction p=0.05).³⁹ The p-value suggests that 1 in 20 trials may show such an effect modification or larger, even if not true.

May not explain — Example: A trial comparing reamed versus unreamed intramedullary nailing of tibial shaft fractures. An interaction p-value of 0.01 provided modest support against chance.¹⁵

Unlikely — Example: The CRASH-2 trial of tranexamic acid in trauma patients found that early treatment (≤1 h) RR 0.68, between 1 and 3 h RR 0.79, and after 3 h RR 1.44. The interaction p-value was smaller than 0.0001.⁴⁰

4 — Did the authors test only a small number of effect modifiers, or consider multiplicity?

Performing multiple tests is a major concern in the context of effect modification analysis. Because multiple tests increase the risk of a chance finding,^{41, 42, 43} credibility is higher if investigators have tested only a small number of effect modifiers.

Multiplicity issues can arise in different ways,⁴⁴ including multiple candidate effect modifiers, multiple time points, multiple scales,⁴⁵ multiple outcomes, or multiple methods for testing the interaction.

An alternative to limiting the number of analyses is to statistically adjust the analysis for multiplicity, e.g. using correction of p-values considering the (familywise) type 1 error rate,⁴⁶ testing all candidate effect modifiers in a common model, or shrinkage estimators.^{33, 47} All techniques inevitably reduce power.^{3, 48, 49}

Assessment of multiplicity crucially depends on reporting; reporting guidelines for effect modification are available.^{50, 51, 52} Note that retrospective statements about the number of pre-specified subgroup analyses are not always reliable.¹⁰

Response options	Description
Definitely no	Explicitly exploratory analysis, or large number of analyses (>10), and multiplicity not considered
Probably no or unclear	No mention of number, or 4–10 effect modifiers tested and not considered
Probably yes	No protocol but unequivocal statement of ≤3 effect modifiers tested
Definitely yes	Protocol available and ≤3 effect modifiers tested, or number considered in analysis

Definitely no — Example: A trial assessing stroke risk after carotid endarterectomy tested more than 20 baseline factors for potential effect modification without adjustment for multiplicity.⁵³ A subsequent trial suggested the opposite association,⁵⁴ suggesting the original finding was a random result.

Probably no — Example: A trial comparing prostatectomy versus observation.³⁸ The investigators did not take into account that they had tested at least seven effect modifiers for this outcome.

Probably yes — Example: A trial testing whether training of health professionals reduces caesarean deliveries. The effect modifier was one of two specified in the study protocol.⁵⁵

Definitely yes — Example: A trial comparing endarterectomy versus medical therapy. The published study protocol specified only one effect modifier.^{53, 56}

5 — Were arbitrary cut points avoided? (continuous variables only)

Categorising continuous effect modifiers is common⁵⁷ but associated with a number of problems^{58, 59}: cut points can introduce multiplicity, reduce power, mask linear or non-linear associations, and complicate comparisons across studies. Analyses that avoid cut points and make use of the full spectrum of values are the most credible.

Credibility is low if investigators selected the best-fitting data-driven cut point to maximise the effect modification. Such cut points are associated with a high rate of false positive claims.^{58, 60}

There are some challenges when modelling continuous effect modifiers that are not part of the instrument but may lower the credibility: model misspecification can occur if the continuous relationship is driven by a few influential observations.^{61, 62, 63} Most credible are therefore continuous analyses for which investigators have pre-specified the type of dependency of the treatment effect on the continuous variable such as a linear or log relationship, or considered a small number of candidate functions.⁶⁴

An alternative to use of cut points and potentially complex modelling is to consider overlapping subgroups (e.g. using a sliding window approach).⁶⁵ The credibility is usually much higher than using arbitrary cut points but the interpretation can be difficult.

The credibility of a continuous analysis usually increases if investigators present a plot with confidence bands around the regression function and carefully checked the proposed model. Formal interaction tests for continuous effect modification are available and should be applied.⁶⁴

Response options	Description
Definitely no	Analysis based on exploratory cut point (e.g. picking cut point associated with highest interaction p-value)
Probably no or unclear	Analysis based on cut point(s) of unclear origin
Probably yes	Analysis based on pre-specified cut points (e.g. from prior RCT)
Definitely yes	Analysis based on the full continuum (e.g. assuming a linear or logarithmic relationship)
Not applicable	Effect modifier is not continuous

Definitely no — Example: A trial tested the effect of everolimus in women with a specific type of breast cancer.⁶⁶ The investigators dichotomised the continuous CIN score using the 75th percentile, explicitly selected because it “yielded the maximal difference in HR between high– versus low–CIN score subgroups.”

Probably no — Example: A trial comparing prostatectomy versus observation for early prostate cancer.³⁸ The investigators hypothesised a potential effect modification by PSA value below or above 10, but provided no rationale for the chosen threshold.

Probably yes — Example: The CRASH-2 trial⁴⁰ tested tranexamic acid in bleeding trauma patients. The investigators had pre-specified the three time-based categories in a published protocol.⁶⁷

Definitely yes — Example: A trial comparing interferon-alpha versus medroxyprogesterone in patients with renal carcinoma.⁶⁸ A subsequent analysis treated white cell count as a continuous variable, avoiding arbitrary cut points and maximising the power of the analysis. A plot of the continuous relationship with confidence bands was provided.⁶⁹

6 — Are there additional considerations that may increase or decrease credibility?

Methodologists have suggested a number of additional considerations that could be relevant for assessing the credibility of effect modifiers.⁷⁰ They are not part of the core items because they either are less relevant, rarely apply, or are difficult to assess. The only response options are probably decreased and probably increased. Leaving this section blank does not affect credibility.

Sensitivity analysis suggesting robustness: A sensitivity analysis can help to increase the confidence in a proposed effect modification.^{71, 72, 73}

Example: The CRASH-2 trial⁴⁰ performed a sensitivity analysis treating time as a continuous effect modifier which suggested that results were robust, despite the pre-specified use of two cut points.

Dose-response effect across levels of the effect modifier: Credibility may be higher if effects increase or decrease monotonically with increases in the levels of the modifier, e.g. an effect that increases incrementally across three or more age groups.

Example: The CRASH-2 trial⁴⁰ found that the effect on death due to bleeding decreased with increasing time from injury (interaction p<0.0001): early treatment (≤1 h) RR 0.68, between 1 and 3 h RR 0.79, after 3 h RR 1.44. The decrease across levels of the effect modifier strengthens the results.

Persistence after adjustment for other potential effect modifiers: Credibility may be higher if a multivariable analysis suggests that the apparent effect modifier is independent of other candidate modifiers.⁷⁴

Example: The CRASH-2 trial.⁴⁰ When including interaction terms for all four effect modifiers in a common model, the effect modification by time from injury remained highly significant (p<0.0001).

Risk of bias of the main effect of the RCTs: A commonly used instrument to formally assess the overall risk of bias is the Cochrane risk of bias tool.⁷⁵ Some studies have suggested that industry-funded trials are at higher risk of spurious claims of effect modification than non-industry-funded studies, especially if the overall effect is not significant.^{76, 77, 78}

Exceptionally high power: Methodologists have argued that the credibility of a proposed effect modification increases with its prospective power.^{49, 79} A rare situation of increased confidence would be a trial of over 10,000 patients with 80% power to detect a significant effect modification suggested in the study protocol.⁴⁹

Consistency across related outcomes: Credibility might be higher if an effect modification is found for biologically related outcomes.

Example: In a trial of reamed versus unreamed nailing of tibial fractures, the difference in re-operations between smokers and non-smokers co-existed in quality of life outcomes.¹⁵

7 — Overall credibility rating

The overall rating is a continuous visual analogue scale spanning four credibility areas, corresponding roughly to <25%, 25–50%, 50–75%, and >75% confidence that the apparent effect modification is true and not the result of chance or bias. The overall rating should be driven by the items that decrease credibility.

Recommended strategy:

All responses definitely or probably reduced credibility or unclear → very low credibility
Two or more responses definitely reduced credibility → maximum usually low credibility
One response definitely reduced credibility → maximum usually moderate credibility
Two responses probably reduced credibility → maximum usually moderate credibility
No responses definitely or probably reduced credibility → high credibility very likely

Credibility	Interpretation	Implication
Very low	Very likely no effect modification	Use overall estimate for each subgroup
Low	Likely no effect modification	Use overall estimate; note remaining uncertainty
Moderate	Likely effect modification	Use separate estimates; note remaining uncertainty
High	Very likely effect modification	Use separate estimates for each subgroup

How to use ICEMAN provides more suggestions for using and presenting ICEMAN in context; Concept and scope of ICEMAN provides a more detailed justification why the scale is continuous and why low credibility suggests likely no effect modification.

Worked example — RCT (CRASH-2)

A secondary publication of the CRASH-2 trial investigated the effect of tranexamic acid versus placebo on death due to bleeding in trauma patients.⁴⁰ The investigators reported that “the effect of tranexamic acid on death due to bleeding varied according to the time from injury to treatment (test for interaction p<0·0001).” Although the investigators labelled their analysis as exploratory, an assessment using ICEMAN suggests moderate credibility.

Item	Response	Comment
1. Direction a priori	Probably yes	Subgroups (not direction) pre-specified in published protocol;⁶⁷ authors stated they had correctly anticipated direction
2. Prior evidence	Little or no support	Three cited papers represent expert opinion but no prior trial or cohort study showing a similar effect modification
3. Interaction test	Chance unlikely	p < 0.00001
4. Number of modifiers	Definitely no	Four pre-specified modifiers but applied to other outcomes than pre-specified (labelled exploratory); however p is very small
5. Cut points	Definitely yes	Two analyses: pre-specified cut points AND a continuous analysis; plot with 95% confidence bands provided
6. Additional (optional)	Probably increases	Effect modification persisted after adjustment for other pre-specified modifiers (p<0.0001); dose-response pattern observed
7. Overall	Moderate	Lack of prior evidence is a limitation; continuous analysis and very small p-value are reassuring

References

1. Laber EB, Qian M. 15. Evaluating personalized treatment regimes [Internet]. In: Methods in comparative effectiveness research. Taylor & Francis Group, 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742: CRC Press; 2017. p. 483–498.Available from: http://dx.doi.org/10.1201/9781315159409-16

2. Henderson NC, Louis TA, Wang C, Varadhan R. Bayesian analysis of heterogeneous treatment effects for patient-centered outcomes research [Internet]. Health Serv. Outcomes Res. Methodol. 2016 ;16(4):213–233.Available from: http://dx.doi.org/10.1007/s10742-016-0159-3

3. Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials [Internet]. JAMA. 1991 ;266(1):93–98.Available from: http://dx.doi.org/10.1001/jama.1991.03470010097038

4. Oxman AD, Guyatt GH. A consumer’s guide to subgroup analyses [Internet]. Ann. Intern. Med. 1992 ;116(1):78–84.Available from: http://dx.doi.org/10.7326/0003-4819-116-1-78

5. Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated [Internet]. BMJ. 1994 ;309(6965):1351–1355.Available from: http://dx.doi.org/10.1136/bmj.309.6965.1351

6. Fletcher J. Subgroup analyses: How to avoid being misled [Internet]. BMJ. 2007 ;335(7610):96–97.Available from: http://dx.doi.org/10.1136/bmj.39265.596262.AD

7. Dijkman B, Kooistra B, Bhandari M, Evidence-Based Surgery Working G. How to work with a subgroup analysis. Canadian journal of surgery Journal canadien de chirurgie. 2009 ;52(6):515–522.

8. Gagnier JJ, Morgenstern H, Altman DG, Berlin J, Chang S, McCulloch P, Sun X, Moher D, Ann Arbor Clinical Heterogeneity Consensus Group. Consensus-based recommendations for investigating clinical heterogeneity in systematic reviews [Internet]. BMC Med. Res. Methodol. 2013 ;13(1):106.Available from: http://dx.doi.org/10.1186/1471-2288-13-106

9. Varadhan R, Stuart EA, Louis TA, Segal JB, Weiss CO. Review of guidance documents for selected methods in patient centered outcomes research: Standards in addressing heterogeneity of treatment effectiveness in observational and experimental patient centered outcomes research. pcori. org,. 2012 ;

10. Kasenda B, Schandelmaier S, Sun X, Elm E von, You J, Blümle A, Tomonaga Y, Saccilotto R, Amstutz A, Bengough T, Meerpohl JJ, Stegert M, Olu KK, Tikkinen KAO, Neumann I, Carrasco-Labra A, Faulhaber M, Mulla SM, Mertz D, Akl EA, Bassler D, Busse JW, Ferreira-González I, Lamontagne F, Nordmann A, Gloy V, Raatz H, Moja L, Rosenthal R, Ebrahim S, Vandvik PO, Johnston BC, Walter MA, Burnand B, Schwenkglenks M, Hemkens LG, Bucher HC, Guyatt GH, Briel M, DISCO Study Group. Subgroup analyses in randomised controlled trials: Cohort study on trial protocols and journal publications [Internet]. BMJ. 2014 ;349(jul16 1):g4539.Available from: http://dx.doi.org/10.1136/bmj.g4539

11. Randomised trial of intravenous S, Oral A, Both, Or neither among 17, 187 cases of suspected acute myocardial infarction: I. ISIS-2 (second international study of infarct survival) collaborative group. Lancet. 1988 ;2(8607):349–360.

12. Russell JA, Walley KR, Singer J, Gordon AC, Hébert PC, Cooper DJ, Holmes CL, Mehta S, Granton JT, Storms MM, Cook DJ, Presneill JJ, Ayers D, VASST Investigators. Vasopressin versus norepinephrine infusion in patients with septic shock [Internet]. N. Engl. J. Med. 2008 ;358(9):877–887.Available from: http://dx.doi.org/10.1056/NEJMoa067373

13. Antiplatelet Trialists’ Collaboration. Collaborative overview of randomised trials of antiplatelet therapy prevention of death, myocardial infarction, and stroke by prolonged antiplatelet therapy in various categories of patients [Internet]. BMJ. 1994 ;308(6921):81–106.Available from: https://doi.org/10.1136/bmj.308.6921.81

14. Parienti J-J, Thirion M, Mégarbane B, Souweine B, Ouchikhe A, Polito A, Forel J-M, Marqué S, Misset B, Airapetian N, Daurel C, Mira J-P, Ramakers M, Cheyron D du, Le Coutour X, Daubin C, Charbonneau P, Members of the Cathedia Study Group. Femoral vs jugular venous catheterization and risk of nosocomial events in adults requiring acute renal replacement therapy: A randomized controlled trial: A randomized controlled trial [Internet]. JAMA. 2008 ;299(20):2413–2422.Available from: http://dx.doi.org/10.1001/jama.299.20.2413

15. Study to Prospectively Evaluate Reamed Intramedullary Nails in Patients with Tibial Fractures Investigators, Bhandari M, Guyatt G, Tornetta P 3rd, Schemitsch EH, Swiontkowski M, Sanders D, Walter SD. Randomized trial of reamed and unreamed intramedullary nailing of tibial shaft fractures: By the study to prospectively evaluate reamed intramedullary nails in patients with tibial fractures (SPRINT) investigators [Internet]. J. Bone Joint Surg. Am. 2008 Dec. ;90(12):2567–2578.Available from: http://dx.doi.org/10.2106/JBJS.G.01694

16. SPRINT Investigators, Bhandari M, Guyatt G, Tornetta P 3rd, Schemitsch E, Swiontkowski M, Sanders D, Walter SD. Study to prospectively evaluate reamed intramedually nails in patients with tibial fractures (S.P.R.I.N.T.): Study rationale and design [Internet]. BMC Musculoskelet. Disord. 2008 ;9(1):91.Available from: http://dx.doi.org/10.1186/1471-2474-9-91

17. Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JPA. Evaluation of evidence of statistical support and corroboration of subgroup claims in randomized clinical trials [Internet]. JAMA Intern. Med. 2017 ;177(4):554–560.Available from: http://dx.doi.org/10.1001/jamainternmed.2016.9125

18. Chan A, Delaloge S, Holmes FA, Moy B, Iwata H, Harvey VJ, Robert NJ, Silovski T, Gokmen E, Minckwitz G von, Ejlertsen B, Chia SKL, Mansi J, Barrios CH, Gnant M, Buyse M, Gore I, Smith J 2nd, Harker G, Masuda N, Petrakova K, Zotano AG, Iannotti N, Rodriguez G, Tassone P, Wong A, Bryce R, Ye Y, Yao B, Martin M, ExteNET Study Group. Neratinib after trastuzumab-based adjuvant therapy in patients with HER2-positive breast cancer (ExteNET): A multicentre, randomised, double-blind, placebo-controlled, phase 3 trial [Internet]. Lancet Oncol. 2016 Mar. ;17(3):367–377.Available from: http://dx.doi.org/10.1016/S1470-2045(15)00551-3

19. Perez EA, Romond EH, Suman VJ, Jeong J-H, Sledge G, Geyer CE Jr, Martino S, Rastogi P, Gralow J, Swain SM, Winer EP, Colon-Otero G, Davidson NE, Mamounas E, Zujewski JA, Wolmark N. Trastuzumab plus adjuvant chemotherapy for human epidermal growth factor receptor 2-positive breast cancer: Planned joint analysis of overall survival from NSABP B-31 and NCCTG N9831 [Internet]. J. Clin. Oncol. 2014 ;32(33):3744–3752.Available from: http://dx.doi.org/10.1200/JCO.2014.55.5730

20. Untch M, Gelber RD, Jackisch C, Procter M, Baselga J, Bell R, Cameron D, Bari M, Smith I, Leyland-Jones B, Azambuja E de, Wermuth P, Khasanov R, Feng-Yi F, Constantin C, Mayordomo JI, Su C-H, Yu S-Y, Lluch A, Senkus-Konefka E, Price C, Haslbauer F, Suarez Sahui T, Srimuninnimit V, Colleoni M, Coates AS, Piccart-Gebhart MJ, Goldhirsch A, HERA Study Team. Estimating the magnitude of trastuzumab effects within patient subgroups in the HERA trial [Internet]. Ann. Oncol. 2008 June ;19(6):1090–1096.Available from: http://dx.doi.org/10.1093/annonc/mdn005

21. Azambuja E de, Holmes AP, Piccart-Gebhart M, Holmes E, Di Cosimo S, Swaby RF, Untch M, Jackisch C, Lang I, Smith I, Boyle F, Xu B, Barrios CH, Perez EA, Azim HA Jr, Kim S-B, Kuemmel S, Huang C-S, Vuylsteke P, Hsieh R-K, Gorbunova V, Eniu A, Dreosti L, Tavartkiladze N, Gelber RD, Eidtmann H, Baselga J. Lapatinib with trastuzumab for HER2-positive early breast cancer (NeoALTTO): Survival outcomes of a randomised, open-label, multicentre, phase 3 trial and their association with pathological complete response [Internet]. Lancet Oncol. 2014 Sept. ;15(10):1137–1146.Available from: http://dx.doi.org/10.1016/S1470-2045(14)70320-1

22. De Laurentiis M, Arpino G, Massarelli E, Ruggiero A, Carlomagno C, Ciardiello F, Tortora G, D’Agostino D, Caputo F, Cancello G, Montagna E, Malorni L, Zinno L, Lauria R, Bianco AR, De Placido S. A meta-analysis on the interaction between HER-2 expression and response to endocrine treatment in advanced breast cancer [Internet]. Clin. Cancer Res. 2005 ;11(13):4741–4748.Available from: http://dx.doi.org/10.1158/1078-0432.CCR-04-2569

23. Dowsett M, Harper-Wynne C, Boeddinghaus I, Salter J, Hills M, Dixon M, Ebbs S, Gui G, Sacks N, Smith I. HER-2 amplification impedes the antiproliferative effects of hormone therapy in estrogen receptor-positive primary breast cancer [Internet]. Cancer Res. 2001 ;61(23):8452–8458.Available from: https://www.ncbi.nlm.nih.gov/pubmed/11731427

24. Trick WE, Miranda J, Evans AT, Charles-Damte M, Reilly BM, Clarke P. Prospective cohort study of central venous catheters among internal medicine ward patients [Internet]. Am. J. Infect. Control. 2006 Dec. ;34(10):636–641.Available from: http://dx.doi.org/10.1016/j.ajic.2006.02.008

25. Corwin HL, Gettinger A, Fabian TC, May A, Pearl RG, Heard S, An R, Bowers PJ, Burton P, Klausner MA, Corwin MJ, EPO Critical Care Trials Group. Efficacy and safety of epoetin alfa in critically ill patients [Internet]. N. Engl. J. Med. 2007 ;357(10):965–976.Available from: http://dx.doi.org/10.1056/NEJMoa071533

26. Corwin HL, Gettinger A, Pearl RG, Fink MP, Levy MM, Shapiro MJ, Corwin MJ, Colton T, EPO Critical Care Trials Group. Efficacy of recombinant human erythropoietin in critically ill patients: A randomized controlled trial: A randomized controlled trial [Internet]. JAMA. 2002 ;288(22):2827–2835.Available from: http://dx.doi.org/10.1001/jama.288.22.2827

27. Ebbeling CB, Leidig MM, Feldman HA, Lovesky MM, Ludwig DS. Effects of a low-glycemic load vs low-fat diet in obese young adults: A randomized trial: A randomized trial [Internet]. JAMA. 2007 ;297(19):2092–2102.Available from: http://dx.doi.org/10.1001/jama.297.19.2092

28. Pittas AG, Das SK, Hajduk CL, Golden J, Saltzman E, Stark PC, Greenberg AS, Roberts SB. A low-glycemic load diet facilitates greater weight loss in overweight adults with high insulin secretion but not in overweight adults with low insulin secretion in the CALERIE trial [Internet]. Diabetes Care. 2005 Dec. ;28(12):2939–2941.Available from: http://dx.doi.org/10.2337/diacare.28.12.2939

29. VanderWeele TJ, Knol MJ. Interpretation of subgroup analyses in randomized trials: Heterogeneity versus secondary interventions [Internet]. Ann. Intern. Med. 2011 ;154(10):680–683.Available from: http://dx.doi.org/10.7326/0003-4819-154-10-201105170-00008

30. Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M, Davey Smith G. Subgroup analyses in randomised controlled trials: Quantifying the risks of false-positives and false-negatives [Internet]. Health Technol. Assess. 2001 ;5(33):1–56.Available from: http://dx.doi.org/10.3310/hta5330

31. Rockette HE, Caplan RJ. Strategies for subgroup analysis in clinical trials [Internet]. Recent Results Cancer Res. 1988 ;11149–54.Available from: http://dx.doi.org/10.1007/978-3-642-83419-6_6

32. Simon R. Patient subsets and variation in therapeutic efficacy [Internet]. Br. J. Clin. Pharmacol. 1982 Oct. ;14(4):473–482.Available from: http://dx.doi.org/10.1111/j.1365-2125.1982.tb02015.x

33. Tanniou J, Tweel I van der, Teerenstra S, Roes KCB. Estimates of subgroup treatment effects in overall nonsignificant trials: To what extent should we believe in them? [Internet]. Pharm. Stat. 2017 July ;16(4):280–295.Available from: https://doi.org/10.1002/pst.1810

34. Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: Current practice and problems [Internet]. Stat. Med. 2002 ;21(19):2917–2930.Available from: http://dx.doi.org/10.1002/sim.1296

35. Altman DG, Bland JM. How to obtain the P value from a confidence interval [Internet]. BMJ. 2011 ;343(aug08 1):d2304.Available from: http://dx.doi.org/10.1136/bmj.d2304

36. Knol MJ, Pestman WR, Grobbee DE. The (mis)use of overlap of confidence intervals to assess effect modification [Internet]. Eur. J. Epidemiol. 2011 Apr. ;26(4):253–254.Available from: http://dx.doi.org/10.1007/s10654-011-9563-8

37. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field AP, Forster M, George EI, Gonzalez R, Goodman S, Green E, Green DP, Greenwald AG, Hadfield JD, Hedges LV, Held L, Hua Ho T, Hoijtink H, Hruschka DJ, Imai K, Imbens G, Ioannidis JPA, Jeon M, Jones JH, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell SE, McCarthy M, Moore DA, Morgan SL, Munafó M, Nakagawa S, Nyhan B, Parker TH, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schönbrodt FD, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts DJ, Winship C, Wolpert RL, Xie Y, Young C, Zinman J, Johnson VE. Redefine statistical significance [Internet]. Nat. Hum. Behav. 2018 Jan. ;2(1):6–10.Available from: http://dx.doi.org/10.1038/s41562-017-0189-z

38. Wilt TJ, Jones KM, Barry MJ, Andriole GL, Culkin D, Wheeler T, Aronson WJ, Brawer MK. Follow-up of prostatectomy versus observation for early prostate cancer [Internet]. N. Engl. J. Med. 2017 ;377(2):132–142.Available from: http://dx.doi.org/10.1056/NEJMoa1615869

39. Wallentin L, Becker RC, Budaj A, Cannon CP, Emanuelsson H, Held C, Horrow J, Husted S, James S, Katus H, Mahaffey KW, Scirica BM, Skene A, Steg PG, Storey RF, Harrington RA, PLATO Investigators, Freij A, Thorsén M. Ticagrelor versus clopidogrel in patients with acute coronary syndromes [Internet]. N. Engl. J. Med. 2009 ;361(11):1045–1057.Available from: http://dx.doi.org/10.1056/NEJMoa0904327

40. CRASH-2 collaborators, Roberts I, Shakur H, Afolabi A, Brohi K, Coats T, Dewan Y, Gando S, Guyatt G, Hunt BJ, Morales C, Perel P, Prieto-Merino D, Woolley T. The importance of early treatment with tranexamic acid in bleeding trauma patients: An exploratory analysis of the CRASH-2 randomised controlled trial [Internet]. Lancet. 2011 ;377(9771):1096–101, 1101.e1–2.Available from: http://dx.doi.org/10.1016/S0140-6736(11)60278-X

41. Mills JL. Data torturing [Internet]. N. Engl. J. Med. 1993 ;329(16):1196–1199.Available from: http://dx.doi.org/10.1056/NEJM199310143291613

42. Counsell CE, Clarke MJ, Slattery J, Sandercock PA. The miracle of DICE therapy for acute stroke: Fact or fictional product of subgroup analysis? [Internet]. BMJ. 1994 ;309(6970):1677–1681.Available from: http://dx.doi.org/10.1136/bmj.309.6970.1677

43. Higgins JPT, Thompson SG. Controlling the risk of spurious findings from meta-regression [Internet]. Stat. Med. 2004 ;23(11):1663–1682.Available from: http://dx.doi.org/10.1002/sim.1752

44. Li G, Taljaard M, Van den Heuvel ER, Levine MA, Cook DJ, Wells GA, Devereaux PJ, Thabane L. An introduction to multiplicity issues in clinical trials: The what, why, when and how [Internet]. Int. J. Epidemiol. 2017 ;46(2):746–755.Available from: http://dx.doi.org/10.1093/ije/dyw320

45. Starr JR, McKnight B. Assessing interaction in case-control studies: Type I errors when using both additive and multiplicative scales: Type I errors when using both additive and multiplicative scales [Internet]. Epidemiology. 2004 July ;15(4):422–427.Available from: http://dx.doi.org/10.1097/01.ede.0000129508.82783.94

46. Shaffer JP. MULTIPLE HYPOTHESIS TESTING. 1995 ;

47. Varadhan R, Wang S-J. Treatment effect heterogeneity for univariate subgroups in clinical trials: Shrinkage, standardization, or else: Treatment effect heterogeneity for univariate subgroups in clinical trials [Internet]. Biom. J. 2016 Jan. ;58(1):133–153.Available from: http://dx.doi.org/10.1002/bimj.201400102

48. Grouin J-M, Coste M, Lewis J. Subgroup analyses in randomized clinical trials: Statistical and regulatory issues [Internet]. J. Biopharm. Stat. 2005 ;15(5):869–882.Available from: http://dx.doi.org/10.1081/BIP-200067988

49. Burke JF, Sussman JB, Kent DM, Hayward RA. Three simple rules to ensure reasonably credible subgroup analyses [Internet]. BMJ. 2015 ;351h5651.Available from: http://dx.doi.org/10.1136/bmj.h5651

50. Kent DM, Rothwell PM, Ioannidis JPA, Altman DG, Hayward RA. Assessing and reporting heterogeneity in treatment effects in clinical trials: A proposal [Internet]. Trials. 2010 ;11(1):85.Available from: http://dx.doi.org/10.1186/1745-6215-11-85

51. Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine–reporting of subgroup analyses in clinical trials [Internet]. N. Engl. J. Med. 2007 ;357(21):2189–2194.Available from: http://dx.doi.org/10.1056/NEJMsr077003

52. Knol MJ, VanderWeele TJ. Recommendations for presenting analyses of effect modification and interaction [Internet]. Int. J. Epidemiol. 2012 Apr. ;41(2):514–520.Available from: http://dx.doi.org/10.1093/ije/dyr218

53. Barnett HJ, Taylor DW, Eliasziw M, Fox AJ, Ferguson GG, Haynes RB. Benefit of carotid endarterectomy in patients with symptomatic moderate or severe stenosis. North American Symptomatic Carotid Endarterectomy Trial Collaborators The New England journal of medicine. 1998 ;339(20):1415–1425.

54. Taylor DW, Barnett HJ, Haynes RB, Ferguson GG, Sackett DL, Thorpe KE. Low-dose and high-dose acetylsalicylic acid for patients undergoing carotid endarterectomy: A randomised controlled trial. ASA and Carotid Endarterectomy (ACE) Trial Collaborators Lancet. 1999 ;353(9171):2179–2184.

55. Chaillet N, Dumont A, Abrahamowicz M, Pasquier JC, Audibert F, Monnier P. A clusterrandomized trial to reduce cesarean delivery rates in quebec. N. Engl. J. Med. 2015 ;372(18):1710–1721.

56. North American Symptomatic Carotid Endarterectomy T. Methods, patient characteristics, and progress. Stroke. 1991 ;22(6):711–720.

57. Fisher DJ, Carpenter JR, Morris TP, Freeman SC, Tierney JF. Meta-analytical methods to identify who benefits most from treatments: Daft, deluded, or deft approach? [Internet]. BMJ. 2017 ;356j573.Available from: http://dx.doi.org/10.1136/bmj.j573

58. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: A bad idea [Internet]. Stat. Med. 2006 ;25(1):127–141.Available from: http://dx.doi.org/10.1002/sim.2331

59. Altman DG, Royston P. The cost of dichotomising continuous variables [Internet]. BMJ. 2006 ;332(7549):1080.Available from: http://dx.doi.org/10.1136/bmj.332.7549.1080

60. Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors [Internet]. J. Natl. Cancer Inst. 1994 ;86(11):829–835.Available from: http://dx.doi.org/10.1093/jnci/86.11.829

61. Royston P, Sauerbrei W. Interaction of treatment with a continuous variable: Simulation study of significance level for several methods of analysis [Internet]. Stat. Med. 2013 ;32(22):3788–3803.Available from: http://dx.doi.org/10.1002/sim.5813

62. Gilthorpe MS, Clayton DG. Statistical interactions and gene-environment joint effects [Internet]. In: Modern methods for epidemiology. Dordrecht: Springer Netherlands; 2012. p. 291–311.Available from: http://dx.doi.org/10.1007/978-94-007-3024-3_17

63. Royston P, Sauerbrei W. Interaction of treatment with a continuous variable: Simulation study of power for several methods of analysis [Internet]. Stat. Med. 2014 ;33(27):4695–4708.Available from: http://dx.doi.org/10.1002/sim.6308

64. Royston P, Sauerbrei W. A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials [Internet]. Stat. Med. 2004 ;23(16):2509–2525.Available from: http://dx.doi.org/10.1002/sim.1815

65. Bonetti M, Zahrieh D, Cole BF, Gelber RD. A small sample study of the STEPP approach to assessing treatment-covariate interactions in survival data [Internet]. Stat. Med. 2009 ;28(8):1255–1268.Available from: https://doi.org/10.1002/sim.3524

66. Hortobagyi GN, Chen D, Piccart M, Rugo HS, Burris HA 3rd, Pritchard KI, Campone M, Noguchi S, Perez AT, Deleu I, Shtivelband M, Masuda N, Dakhil S, Anderson I, Robinson DM, He W, Garg A, McDonald ER 3rd, Bitter H, Huang A, Taran T, Bachelot T, Lebrun F, Lebwohl D, Baselga J. Correlative analysis of genetic alterations and everolimus benefit in hormone receptor-positive, human epidermal growth factor receptor 2-negative advanced breast cancer: Results from BOLERO-2 [Internet]. J. Clin. Oncol. 2016 ;34(5):419–426.Available from: http://dx.doi.org/10.1200/JCO.2014.60.1971

67. The CRASH Trials Co-ordinating Centre. Protocol 05PRT/1. 2005 ;

68. Medical Research Council Renal Cancer Collaborators. Interferon-alpha and survival in metastatic renal carcinoma: Early results of a randomised controlled trial. Lancet. 1999 ;353(9146):14–17.

69. Royston P, Sauerbrei W, Ritchie A. Is treatment with interferon-alpha effective in all patients with metastatic renal carcinoma? A new approach to the investigation of interactions [Internet]. Br. J. Cancer. 2004 ;90(4):794–799.Available from: http://dx.doi.org/10.1038/sj.bjc.6601622

70. Schandelmaier S, Chang Y, Devasenapathy N, Devji T, Kwong JSW, Colunga Lozano LE, Lee Y, Agarwal A, Bhatnagar N, Ewald H, Zhang Y, Sun X, Thabane L, Walsh M, Briel M, Guyatt GH. A systematic survey identified 36 criteria for assessing effect modification claims in randomized trials or meta-analyses [Internet]. J. Clin. Epidemiol. 2019 Sept. ;113159–167.Available from: http://dx.doi.org/10.1016/j.jclinepi.2019.05.014

71. Desai M, Pieper KS, Mahaffey K. Challenges and solutions to pre- and post-randomization subgroup analyses [Internet]. Curr. Cardiol. Rep. 2014 ;16(10):531.Available from: http://dx.doi.org/10.1007/s11886-014-0531-2

72. VanderWeele T. Explanation in causal inference: Methods for mediation and interaction. 1st ed. New York, NY: Oxford University Press; 2015.

73. Pearce N, Greenland S. Confounding and interaction [Internet]. In: Ahrens W, Pigeot I, editor(s). Handbook of epidemiology. New York, NY: Springer New York; 2014. p. 659–684.Available from: https://doi.org/10.1007/978-1-4614-6625-3_10

74. Varadhan R, Wang S-J. Standardization for subgroup analysis in randomized controlled trials [Internet]. J. Biopharm. Stat. 2014 ;24(1):154–167.Available from: http://dx.doi.org/10.1080/10543406.2013.856023

75. Higgins J, Sterne J, Savović J, Page M, Hrõbjartsson A, Boutron I, Reeves B, Eldridge S. A revised tool for assessing risk of bias in randomized trials. Cochrane Database of Systematic Reviews. 2016 ;1029–31.

76. Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F, Bala MM, Bassler D, Mertz D, Diaz-Granados N, Vandvik PO, Malaga G, Srinathan SK, Dahm P, Johnston BC, Alonso-Coello P, Hassouneh B, Truong J, Dattani ND, Walter SD, Heels-Ansdell D, Bhatnagar N, Altman DG, Guyatt GH. The influence of study characteristics on reporting of subgroup analyses in randomised controlled trials: Systematic review [Internet]. BMJ. 2011 ;342(mar28 1):d1569.Available from: https://doi.org/10.1136/bmj.d1569

77. Barton S, Peckitt C, Sclafani F, Cunningham D, Chau I. The influence of industry sponsorship on the reporting of subgroup analyses within phase III randomised controlled trials in gastrointestinal oncology [Internet]. Eur. J. Cancer. 2015 Dec. ;51(18):2732–2739.Available from: http://dx.doi.org/10.1016/j.ejca.2015.08.030

78. Gabler NB, Duan N, Raneses E, Suttner L, Ciarametaro M, Cooney E, Dubois RW, Halpern SD, Kravitz RL. No improvement in the reporting of clinical trial subgroup effects in high-impact general medical journals [Internet]. Trials. 2016 ;17(1):320.Available from: http://dx.doi.org/10.1186/s13063-016-1447-5

79. Alosh M, Huque MF, Bretz F, D’Agostino RB Sr. Tutorial on statistical considerations on subgroup analysis in confirmatory clinical trials [Internet]. Stat. Med. 2017 ;36(8):1334–1360.Available from: http://dx.doi.org/10.1002/sim.7167