clinical trials

A few months ago, a controversial scientific paper came out. It happens every once in a while – a paper that questions or refutes something thought to have been relatively well established or that shines a light on something we’ve been doing wrong. I’ve written about this kind of stuff before (see here), but this one is a bit different. It calls into question something we’ve taken for granted for the past decade or so – information that forms the basis of treatment decisions affecting patients on a daily basis. More importantly, the story around the paper teaches a few important lessons about how we do research.

One of the treatments for an acute ischemic stroke is intravenous thrombolysis – a medication that helps break up the blood clot that’s blocking an artery and causing the stroke. For about a decade since the medication was approved for use, it was only used in patients who could be treated within three hours of their stroke symptoms starting. Early studies showed that, when patients are treated after three hours, the risks of the treatment (mostly bleeding, sometimes in the brain) outweigh the benefit.

A study conducted in 2008 changed all that. It showed that thrombolysis is effective and relatively safe up to 4.5 hours after stroke symptoms start. So the guidelines changed – at least in Europe. In the US, the FDA decided not to extend the treatment’s indication for reasons that are not entirely clear – still, even there, the study’s results led to more use of thrombolysis in this “extended” time window (as an “off-label” treatment).

The 2008 study drew some criticism early on, particularly because the two groups (those treated with thrombolysis and those given placebo) weren’t well matched – in this context, that just meant that patients in the thrombolysis group had, on average, less severe strokes and more patients in the placebo group had strokes prior to the one they were included in the study for. So the argument went: it’s possible that the thrombolysis group did better not because they received the medication, but because they were already less affected by their current stroke and fewer of them had old strokes.

So the authors of this new paper (the one that came out a few days ago – let’s call it the “2020 study”) got ahold of the data from the 2008 study and decided to reanalyze it, taking into account these “baseline differences” that indicate the groups are not “well-matched”. This is a summary of what they found:

Many of the results of the 2008 study could only be reproduced under a set of conditions that were not pre-specified by the investigators of the 2008 study. These included excluding some patients and turning some variables from continuous to categorical (all potentially justifiable things to do). Note that this is separate from the issue of “baseline differences” that were not adjusted for in the 2008 study – this is an attempt to reproduce the exact results of the 2008 study using the data from the 2008 study and the way the authors of the 2008 study reported that the data were analyzed.
After adjusting for the effect of the “baseline differences” that were not adjusted for in the 2008 study, the groups were no longer statistically significantly different in terms of any of the outcomes that the 2020 study authors looked at. The one exception was that the thrombolysis group had more brain bleeds than the placebo group.

This whole debacle brings several issues with the way we do science, particularly science that is used to influence how we treat patients, to the forefront:

First of all, relying on the results of a single study – no matter how large or seemingly robust – to change clinical practice is a bad idea (the authors of the 2020 study mention this as well). Every study has unique factors that threaten either its external or internal validity (sometimes both) and therefore limit the extent to which it can be relied upon to represent some kind of “truth”. This is a really hard pill for most clinicians to swallow. Some of them because they invest years in designing and conducting trials, many of them honestly doing their very best to come up with robust and reliable evidence. And I’m not saying those efforts are in vain – clearly studies exist on a spectrum of quality, and the decisions that investigators make can greatly influence that quality. But still, no matter how hard we try, there will never be such a thing as a perfect single study – with results that hold true under all circumstances (I don’t mean all conceivable circumstances – even under a particular limited set of circumstances). Even clinicians who aren’t involved in conducting trials find it hard to believe that there should be no such thing as a single “practice-changing” study – mostly because they are eager to help their patients (if you’ve ever been to a big clinical conference, note the standing ovations and crowd’s elation when “positive” clinical trial results are presented). Add to that the expectations from regulatory authorities (sometimes inadequate) as well as issues of equipoise and economics, and you start to understand why we, as a community, believe that as long as it’s a (relatively) well-designed randomized clinical trial, its results are good enough to change our practice.
Related to the first point: knowledge (defined in this case as the information we get from seemingly well-designed and robustly conducted studies) is fragile. Slightly changing the way a variable is defined (continuous vs categorical, for example) or removing a few subjects with missing data can swing your results one way or the other. This is a well-known issue and is related to not defining analysis strategies before data collection, the garden of forking paths, etc. But it seems that many clinicians and clinical researchers are either not aware of it or underestimate just how big a deal it is. I’ve had countless conversations with peers who believe that if you’ve got some “good data” (collected appropriately from a well-designed study, without any funny business of any kind), how the data is analyzed shouldn’t influence the results in a major way. The data are the data – they more or less should speak for themselves – as long as I didn’t tamper with the data and I used the “correct” tests, why would my analysis approach mislead me? It can, and very often does.
In the world of clinical trials, statistics are still commonly misunderstood and misused. The 2020 authors themselves make a very prevalent mistake – confusing a lack of a statistically significant difference between two groups as evidence that the groups are equivalent (or in their words, “matched”). For more information on this, see here and here. This isn’t just a statistical technicality – in the 2020 study, the only variables that were “adjusted” for in the analysis were the ones that were statistically significantly different between the groups, so many others were potentially missed. In fact, testing for “baseline differences”, regardless of how, is very much a contested practice (see here, here, and here), but clinical trials are full of it. That’s surprising, because there are often biostatisticians on the investigator panels of such trials and biostatisticians presumably review at least some of the published trial protocols and reports.

I’m not sure if the 2020 study will directly change stroke management – the authors are careful with the interpretation of their findings (rightfully so, in my opinion), saying that their study “reduce[s] [the] certainty” in the conclusions of the 2008 study. But I do hope we do learn some things from this – clinicians really need to rethink how they view single clinical trials, take matters like analytical flexibility more seriously, and avoid common statistical misconceptions.

IMG_20180516_102631

I just got back from one of the world’s largest stroke meetings, the European Stroke Organisation Conference (ESOC), held this year in Gothenburg, Sweden. The overwhelming focus of the conference is on groundbreaking large clinical trials, reports of which dominate the plenary-like sessions of the schedule. One thing I’ve noticed about talks on clinical trials is how, every year, the speakers go to great lengths to emphasize some (positive) ethical, methodological, or statistical aspect of their study.

This is the result of something that I like to call the RRRR cyle (pronounced “ARRARRARRARR” or “rrrr” or “quadruple R” or whichever way won’t scare/excite those in your immediate vicinity at that moment in time). It usually starts with loud opposition (reprimand) to some aspect of how clinical trials are run or reported. This usually comes from statisticians, ethicists, or more methodologically inclined researchers. Eventually, a small scandal ensues, and clinical researchers yield (usually after some resistance). They change their ways (repentance) and, in doing so, become fairly vocal about what they’re now doing better (representation)*.

Examples that I’ve experienced in my career as a stroke researcher thus far are:

Treating polytomous variables as such instead of binarizing them (the term “shift analysis” – in the context of outcome variables – is now an indispensable part of the clinical stroke researcher’s lexicon).
Pre-specifying hypotheses, especially when it comes to analyzing subgroups.
Declaring potential conflicts of interest.

Most of these practices are quite fundamental and may have been standard in other fields before making their way to the clinical trial world (delays might be caused by a lack of communication across fields). Still, it’s undoubtedly a good thing that we learn from our mistakes, change, and give ourselves a subtle pat on the back every time we put what we’ve learned to use.

The reason I bring it up is, maybe soon someone** could start making some noise about one of the following issues that come up way too often in my opinion:

(Mis-)interpreting p-values that are close to 0.05, and how this is affected by confirmation bias.

In the SAME talk:

A stat non-significant result that goes AGAINST the study's expectations = we can't really interpret this, underpowered

A stat non-significant result that CONFORMS to the study's expectations = interesting, more n would have given p<0.05 #ESOC2018 pic.twitter.com/wgjFQjITUm

— Ahmed Khalil (@AhmedAAKhalil) May 18, 2018

Testing if groups are “balanced” in terms of baseline/demographic variables in trials using “traditional” statistical methods instead of equivalence testing.

As the ESOC meeting keeps reminding me, a lot can be done in a year. So I’m pretty optimistic we can get some of these changes implemented by ESOC 2019 in Milan!

* If you think this particular acronym is unnecessary or a bit of a stretch, I fully agree. I also urge you to take a look at this paper for a list of truly ridiculous acronyms (all from clinical trials of course).

** I would, but I’m not really the type – I’d be glad to loudly bang the drums after someone gets the party started, though.

Ahmed A. Khalil

MD PhD

Clinical trials and the fragility of “knowledge”

Clinical trials and the RRRR cycle