The Illusory Side of “Comparative Effectiveness Research”


Below, a Guest Post by Dr. Nortin M. Hadler, Professor of Medicine and Microbiology/Immunology, University of North Carolina at Chapel Hill and Dr. Robert A. McNutt,  Professor of Medicine, Chief,  Section on Medical Informatics and Patient Safety, Rush University Medical Center, Chicago.  Their argument that comparative effectiveness research (CER) needs an “anchor”—one treatment with known efficacy—is a good one, and gave me a new perspective on CER. In their analysis of randomized controlled trials, they highlight the crucial question: how high should we set the bar to consider the results of the trial compelling?


“Comparative effectiveness research” is now legislated as a priority for translational research. The goal is to inform decision making by assessing relative effectiveness in practice. An impressive effort has been mobilized to target efforts and establish a methodological framework. We argue that any such exercise requires a comparator with known and meaningful efficacy; there must be at least one anchoring group or subset for which a particular intervention has reproducible and meaningful benefit in randomized controlled trials. Without such, there is a likelihood that the effort will degenerate into comparative ineffectiveness research.

As charged in the American Recovery and Reinvestment Act, the Institute of Medicine defined comparative effectiveness research (CER) as “ …the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition, or to improve the delivery of care… at both the individual and population levels.”

However, you can’t compare treatments for effectiveness from observational data unless you are certain that one of the comparators is efficacious. There must be at least one group of patients for whom the treatment has unequivocal efficacy. Otherwise, CER might discern differences in relative ineffectiveness.  We argue that CER cannot succeed as the primary mechanism to assure the provision of rational health care.

 The difference between efficacy and effectiveness

 The science of efficacy tests the hypothesis that a particular intervention works in a particular group of patients. CER asks whether an intervention works better than other interventions in practice where patients are more heterogeneous than those recruited and accepted in a trial.  The gold standard of efficacy research is the randomized controlled trial (RCT). RCTs usually monitor defined, albeit sizeable populations for surrogate outcomes in order to detect a difference in the short term. Modern biostatistics has probed every nuance of the RCT paradigm. The result is a highly sophisticated understanding of its limitations.  A particularly vexing limitation is that the RCT fails to test hypotheses broadly enough; that is, RCTs limit the variability of patients making it difficult to generalize the value of treatments to those not studied.

 CER to the rescue?

The methodology employed for CER is not constrained by limits on patient variability as in RCTs. CER utilizes real world data sets to deduce benefit/ harm in a range of patients -including those who might reasonably be excluded from a RCT. This entails large clinical and administrative networks to provide data. Datasets must be large enough to capture individuals’ differences that affect the estimates of benefit/harm across the gamut of insurance, age, co-morbidities, and life style.  This inclusivity is paramount. For example, when we buy a book at, we are given a list of “other books bought by those who bought your book”. There is a data mining program in the background that links characteristics of the book you bought to characteristics of books bought like yours and to the characteristics of buyers.  A different list of book recommendations results based on variations in buyer characteristics, like age, gender, and purchase history. This is a perfect analogy to what CER promises.  

But, there are fundamental differences between book buying and health care provisions. In book buying there is a defined/ homogenous outcome, the book; health care outcomes are not homogeneous and often subjective (life, function, jobs, fancier hospitals, etc.).  It is hard to imagine the messy list from “Amazon” health we would see based on what we chose as a goal of health care; it is easy to imagine how readily perturbed the list would be by introducing nuances in outcome. One of the fundamental problems with attempts to rationalize health care is that we still don’t agree on how to measure either health or what is rational care.

Furthermore, Amazon is not the sole vendor of books. The associations at Amazon may not reflect the totality of characteristics (books and people) across all places books can be bought. Hence, any book list suggested solely by Amazon may be incomplete or flawed. For CER to be a valid “Amazon” for health care, it has to define and capture the nuances of health care outcomes and provision across all sites of care (including the home).  

Clearly any inference regarding relative benefits and harms from the analysis of large datasets is suspect. Shortcomings relating to benefits, harms and provision of care are lurking.  Any statistical modeling would require assumptions and compromises.  Hence, the validity of interpreting observational data will depend on the degree to which diagnosis, clinical course, interventions, coincident diseases, personal characteristics or outcomes is assumed and not quantified. No matter how compulsively this is done, CER demands judgments about the importance of each of these variables. Therefore, CER cannot be the engine of health care decision making.

As an example, total knee replacement (TKR) has at present escaped efficacy testing. How would we learn from observational research if TKR works? Some of the relevant variables to assess efficacy can be parsed from observational data such as patient demographics, type of hardware, co-morbidities, and the like. However, some variables are very difficult to parse in the best of circumstances – such as a definition of benefit; or surgical experience; or, more elusive, surgical skillfulness.

 Efficacy research is the horse; CER is the cart.

There are 2 alternative ways forward other than the present plans for CER. First, we could design efficacy trials that are efficient in providing gold standards across a wider range of patient characteristics. We would have to expand trials to larger populations. For the sake of validity, we would have measure only a single clinically meaningful outcome even if that took a great deal of time. And we’d have to foreswear all shortcuts that trade reliability off against efficiency (such as “tack on” questions for “post-hoc” analysis).

There is a second approach that is more straightforward. We can design elegant RCTs seeking a large enough clinically meaningful outcome on highly selected patient populations. If none is detected, we can either abandon the intervention or choose another highly selected population to study. If a clinically meaningful difference is detected, the result can serve as the anchoring comparator for CER.

However, to design such a straightforward RCT, we must also deal with the philosophical challenge in the design of efficacy trials; the challenge that relates to the notion of “clinically significant.” How high should we set the bar for the absolute difference in outcome between the treated and control groups to consider the results of the trial compelling?

One way to think about this is to convert the absolute difference into a more intuitive measure, the Number Needed to Treat (NNT). If the outcome is easily measured, such as death or stroke, for example, we might find an intervention valuable if we had to treat 20 patients to spare 1. Few students of efficacy would be persuaded if we had to treat more than 50 to spare 1. Between 20 and 50 delineates debate; smaller effects are ephemeral and subject to false positive assertions. For an outcome that is more difficult to measure, such as symptoms or quality of life, we would argue for a more stringent bar. If we framed the problem of RCT design like this, we may be able to engage a national debate on just how high the bar should be set for each clinical malady.

If we, then, applied this stringency to future RCT design, trials would be more efficient and reliable and would eliminate trials aimed to test equivalency. Then, armed with clinically meaningful RCT results in some subset of patients, we are in the position to turn to CER. CER will help us seek out other subsets of patients benefited at least as much and to identify subsets harmed. We feel that it would not be in the best interest of our public and personal health to prematurely seek answers in flawed datasets at the expense of forgoing best evidence in better RCT designs.  

CONFLICT OF INTEREST DISCLOSURES:  The authors have none to report.


8 thoughts on “The Illusory Side of “Comparative Effectiveness Research”

  1. Some wording choice questions:
    “The Illusory Side of “Comparative Effectiveness Research”
    Do I take from that that you mean prospective CER utility is not ENTIRELY “illusory”?
    “The difference between efficacy and effectiveness”
    Maybe I’m missing something, but I don’t see any direct definitions of either term proffered. You go on to speak about “efficacy” broadly, but then never talk about the salient difference between it and “effectiveness.” Perhaps there’re some clinical “terms of art” nuances that escape us mere lay people who simply go to the dictionary when in search of guidance for precise lexical/semantic meaning bearing on some issue.
    Adversaries in court proceedings have to “stipulate” to explicitly shared denotative meanings of critical terms.
    Definition of EFFICACY
    :the power to produce an effect
    Synonyms: edge, effectiveness, effectualness, efficaciousness, efficacity, efficiency, productiveness
    Definition of EFFECTIVE
    a:producing a decided, decisive, or desired effect
    Synonyms: effectual, efficacious, efficient, fruitful, operative, potent, productive
    “CER might discern differences in relative ineffectiveness.”
    OK, “ineffective”? Application of some treatment that produces no empirically defensible outcome differential relative to some baseline/control?
    While I get your overall point (and I have my own reservations about CER), I worry about a little potential for “straw man” here: “CER cannot succeed as the primary mechanism to assure the provision of rational health care.”
    Is THAT what is being proposed? “primary mechanism”? Does PCAST say that? Versus “one more tool”? (I’d have to go back and look)
    Just asking.
    None of the foregoing is to underestimate the daunting difficulties that entail clinical outcomes analysis, be they from the trials or from the aggregated in-the-trenches practice data.

  2. We apologize. It was presumptuous of us to assume that all knew the difference between efficacy and effectiveness. This is epidemiology-speak. Efficacy relates to demonstrable benefit in a randomized controlled trial. Effectiveness relates to demonstrable benefit in more general use, beyond the constraints of the RCT, particularly the selection criteria for subjects. RCTs usually have many such specifications including gender, age, coincident disease, etc. General practice may have no such constraints.

  3. An important and interesting article. Thanks! The health care system can’t afford incremental expenses for procedures which do not contibute to improved outcomes. Most protocols are limited in their effectiveness because they are based on the results of clinical trials conducted on a general population, yet no two patients are alike. The efficacy of a drug is to produce a desired effect, but apparently this has nothing to do with survival. The goal should be to identify through real-time comparative effectivenss research which regimens work best for which patient groups. The types of research that would be most effective in providing the needed evidence would be synthesis of existing evidence (e.g. qualitative review, meta-analysis), primary research using existing health care databases, primary research using prospective data collection without randomization (e.g. observational study, registry), and primary research through a prospective randomized trial.

  4. In antibiotics, many RCTs are not placebo controlled, but compare against an existing therapy. The new drug isn’t required to prove superiority, just “non-inferiority” by a prescribed margin (ie, not more than 20% less effective). In some cases, the comparator itself was approved through a non-inferiority trial years ago, or has degraded over time through resistance. See this GAO Report:

  5. Bobby G, Greg
    Bobby G–
    First, these are all very good questions.
    — On efficacy and effectiveness, as Nortin says in his reply, this is epidemiology-speak. It’s an important distinction within the world of medicine, but in reguar English, the two words are, as the dictionary indicates, virtually synonomous.
    I should have inserted something in the post pointing readers to what Hadler has written in the past about how he uses the words.
    In another essay, he explained that: “you can’t compare treatments for effectiveness unless you are quite certain that one of the comparators is truly efficacious. There must be a group of patients for whom one treatment has unequivocal and important efficacy.”
    In other words, efficacious means “more than” effective. (Usually the ious at the end of a word suggests “more” or “full of’ — ambitious, mellifluous, etc. )
    And yes, he doesnt’ think that CER is entirely illusory. Hader believe that CER is potentially very valuable, but should be seen in context.
    Finally, I agree with you. I don’t think that intelligent reformers suggest that CER is a silver bullet.
    But it is one of the very valuable tools that we can use to move toward patient-centered medicine.
    Here I follow Don Berwick in defining patient-centered medicine as trying to give each patient “the right treatment at the right time–no less than she needs, and no more than she wants.”
    Good to hear form you.
    You write: “The goal should be to identify through real-time comparative effectivenss research which regimens work best for which patient groups. The types of research that would be most effective in providing the needed evidence would be a synthesis . . .”
    I agree wholeheartedly.
    Comparative effectiveness research must look at groups of patients who fit a particular medical profile and ask which of these groups will benefit.

  6. Having seen first hand some authors that flat out lie in their academic publications, I’m not sure that there is enough quality control in the literature. I have also seen amazing honesty also, especially from authors of more advanced age.
    There are a number of other issues for sure as outlined in this letter to the editor in the WSJ:
    Those that place faith in this system need to be more wary.

  7. Dr. McNutt and I thank those who have commented. Honing the notions of efficacy and effectiveness is an important challenge. Informed medical decision making, rather than shared decision making, means that a patient must be able to tell if the potential value to gain for taking something rather than something else is worth the potential value to lose. All medical decisions are trade-offs, but only, in clarification, if there is some semblance of an idea that there is a sizable enough benefit and that the size of that benefit is worth the sizable harm. This requires that the absolute, marginal difference for some good outcome and some other bad outcome be real and important. If researchers are to be of value to medical care, they must be able to tell us these things.
    However, too many studies focus on the population; the average; and not me and you. We are not averages. While the marginal difference between treatment A and B may be, for example, 5%, that does not mean that everyone who takes the treatment gets the chance of that 5%. Some will be 0%; others may be 15%. Informing a patient with the average margin is not informing at all if the patient is at 0%.
    Clinical trials, as presently conducted, are not doing what they could to inform individuals; too little variation in those patient characteristics that lead to variable, marginal responses are excluded or under-included under present standards of research. This must change if we want to be serious about informing individuals and then letting the chips fall on medical resource use. The only ways, we think, to get to where we need to go is better and bigger trials with more variation, not less, or even better, trials that look for meaningful differences (at least 5% marginal difference). We like the latter as patients will likely be able to “feel” that sort of difference. It is hard to imagine patients coming to grips with differences less than that.

  8. Nortin–
    Thank you very much for your comment.
    Yes, patients need to understand that the “average” benefit may or may not apply to them.
    This is why compartive effectiveness reserach must look at the benefits for patietns that fit a particuarl medical profile–which means, as you say, that trials need to cover much larger, more varied group of patietts– and most importnatly, trials should emphasize meaningful differences (which you define as 5% or more).
    Too many patients are given false hope based on trial results that just aren’t meaningful for them.