Clinical trials and the fragility of “knowledge”

A few months ago, a controversial scientific paper came out. It happens every once in a while – a paper that questions or refutes something thought to have been relatively well established or that shines a light on something we’ve been doing wrong. I’ve written about this kind of stuff before (see here), but this one is a bit different. It calls into question something we’ve taken for granted for the past decade or so – information that forms the basis of treatment decisions affecting patients on a daily basis. More importantly, the story around the paper teaches a few important lessons about how we do research.

One of the treatments for an acute ischemic stroke is intravenous thrombolysis – a medication that helps break up the blood clot that’s blocking an artery and causing the stroke. For about a decade since the medication was approved for use, it was only used in patients who could be treated within three hours of their stroke symptoms starting. Early studies showed that, when patients are treated after three hours, the risks of the treatment (mostly bleeding, sometimes in the brain) outweigh the benefit.

A study conducted in 2008 changed all that. It showed that thrombolysis is effective and relatively safe up to 4.5 hours after stroke symptoms start. So the guidelines changed – at least in Europe. In the US, the FDA decided not to extend the treatment’s indication for reasons that are not entirely clear – still, even there, the study’s results led to more use of thrombolysis in this “extended” time window (as an “off-label” treatment).

The 2008 study drew some criticism early on, particularly because the two groups (those treated with thrombolysis and those given placebo) weren’t well matched – in this context, that just meant that patients in the thrombolysis group had, on average, less severe strokes and more patients in the placebo group had strokes prior to the one they were included in the study for. So the argument went: it’s possible that the thrombolysis group did better not because they received the medication, but because they were already less affected by their current stroke and fewer of them had old strokes.

So the authors of this new paper (the one that came out a few days ago – let’s call it the “2020 study”) got ahold of the data from the 2008 study and decided to reanalyze it, taking into account these “baseline differences” that indicate the groups are not “well-matched”. This is a summary of what they found:

Many of the results of the 2008 study could only be reproduced under a set of conditions that were not pre-specified by the investigators of the 2008 study. These included excluding some patients and turning some variables from continuous to categorical (all potentially justifiable things to do). Note that this is separate from the issue of “baseline differences” that were not adjusted for in the 2008 study – this is an attempt to reproduce the exact results of the 2008 study using the data from the 2008 study and the way the authors of the 2008 study reported that the data were analyzed.
After adjusting for the effect of the “baseline differences” that were not adjusted for in the 2008 study, the groups were no longer statistically significantly different in terms of any of the outcomes that the 2020 study authors looked at. The one exception was that the thrombolysis group had more brain bleeds than the placebo group.

This whole debacle brings several issues with the way we do science, particularly science that is used to influence how we treat patients, to the forefront:

First of all, relying on the results of a single study – no matter how large or seemingly robust – to change clinical practice is a bad idea (the authors of the 2020 study mention this as well). Every study has unique factors that threaten either its external or internal validity (sometimes both) and therefore limit the extent to which it can be relied upon to represent some kind of “truth”. This is a really hard pill for most clinicians to swallow. Some of them because they invest years in designing and conducting trials, many of them honestly doing their very best to come up with robust and reliable evidence. And I’m not saying those efforts are in vain – clearly studies exist on a spectrum of quality, and the decisions that investigators make can greatly influence that quality. But still, no matter how hard we try, there will never be such a thing as a perfect single study – with results that hold true under all circumstances (I don’t mean all conceivable circumstances – even under a particular limited set of circumstances). Even clinicians who aren’t involved in conducting trials find it hard to believe that there should be no such thing as a single “practice-changing” study – mostly because they are eager to help their patients (if you’ve ever been to a big clinical conference, note the standing ovations and crowd’s elation when “positive” clinical trial results are presented). Add to that the expectations from regulatory authorities (sometimes inadequate) as well as issues of equipoise and economics, and you start to understand why we, as a community, believe that as long as it’s a (relatively) well-designed randomized clinical trial, its results are good enough to change our practice.
Related to the first point: knowledge (defined in this case as the information we get from seemingly well-designed and robustly conducted studies) is fragile. Slightly changing the way a variable is defined (continuous vs categorical, for example) or removing a few subjects with missing data can swing your results one way or the other. This is a well-known issue and is related to not defining analysis strategies before data collection, the garden of forking paths, etc. But it seems that many clinicians and clinical researchers are either not aware of it or underestimate just how big a deal it is. I’ve had countless conversations with peers who believe that if you’ve got some “good data” (collected appropriately from a well-designed study, without any funny business of any kind), how the data is analyzed shouldn’t influence the results in a major way. The data are the data – they more or less should speak for themselves – as long as I didn’t tamper with the data and I used the “correct” tests, why would my analysis approach mislead me? It can, and very often does.
In the world of clinical trials, statistics are still commonly misunderstood and misused. The 2020 authors themselves make a very prevalent mistake – confusing a lack of a statistically significant difference between two groups as evidence that the groups are equivalent (or in their words, “matched”). For more information on this, see here and here. This isn’t just a statistical technicality – in the 2020 study, the only variables that were “adjusted” for in the analysis were the ones that were statistically significantly different between the groups, so many others were potentially missed. In fact, testing for “baseline differences”, regardless of how, is very much a contested practice (see here, here, and here), but clinical trials are full of it. That’s surprising, because there are often biostatisticians on the investigator panels of such trials and biostatisticians presumably review at least some of the published trial protocols and reports.

I’m not sure if the 2020 study will directly change stroke management – the authors are careful with the interpretation of their findings (rightfully so, in my opinion), saying that their study “reduce[s] [the] certainty” in the conclusions of the 2008 study. But I do hope we do learn some things from this – clinicians really need to rethink how they view single clinical trials, take matters like analytical flexibility more seriously, and avoid common statistical misconceptions.

Publishers: we may do stupid things sometimes, but give us scientists some credit

Yesterday evening, I attended a Meetup organized by Impact of Science at the Alexander von Humboldt Institut for Internet and Society in Berlin. The topic was the highly publicized Projekt DEAL negotiations with Elsevier et al. and the future of Open Access research in Germany and beyond.

The speakers were Dr. Nick Fowler, Chief Academic Officer and Managing Director of Research Networks at Elsevier, and Prof. Dr. Gerard Meijer, Director of the Fritz Haber Institute of the Max Planck Society and a member of the German science council.

Each speaker gave their perspective on the past, present, and future of the negotiations. To me, it was refreshing hearing the “other side’s” take on the matter. Dr. Fowler highlighted some important issues that made the negotiations more complicated and difficult. Before that, I admit I had mostly been exposed to the viewpoint of the proponents of Projekt DEAL. So I started the evening thinking “OK, this is good. I’m learning some new stuff and some of these counter-arguments make sense.”

That feeling didn’t last long. After Prof. Meijer’s talk, we went into the discussion round. I asked Dr. Fowler to comment on what kind of value the big publishers add to an article. The issue had been brought up briefly by the speakers before this, and the consensus from both sides (including, it seemed, at least the general position of Projekt DEAL albeit not that of Prof. Meijer, as he explained later), was that publishers do add value to articles. I wanted to know what kind of value was added, how this is quantified, and how it relates to the price that researchers and institutes pay publishers for the processing of gold open access articles or subscribing to paywalled ones.

Do publishers improve the articles that are submitted to their journals?

Dr. Fowler explained how they have evidence that articles published by Elsevier are of higher quality than the rest of the published literature. He used that as proof that their publishing system improves articles, and casually swept aside the idea of added value – the difference between the quality of submitted manuscripts and their published counterparts.

Hold on, you might say. Did he just try to pull a fast one on a room full of scientists? I mean, scientists may have stuck with a flawed and exploitative publishing system for decades, but we know a potential confounding factor (and an inappropriate surrogate outcome) when we see one. Elsevier is the largest, most well-known scientific publisher on the planet. Of course they publish the highest-quality research, because they probably get the highest quality submissions.

Scientists know a potential confounding factor when they see one.

Whether or not value is added during the submission and review process is not a minor detail, it’s at the core of the problem, and therein lies the solution. If the system that publishers provide does add value, how much of that is attributable to peer review, something that researchers do for free (a point emphasized during the discussion by Prof. Meijer)?

I’m not saying publishers don’t add value, but as a scientist, I want to see some evidence. Mind you, I wasn’t asking Dr. Fowler for a randomized controlled trial of their article processing system versus some generic processing-type intervention. I probably would have been satisfied with some As-Seen-On-TV-style before-and-after photos. Instead, the audience got a response that proves nothing.

Why most of “clinical” imaging research is methodological – and why that’s OK

When people ask me what kind of research I did during my PhD (and indeed what kind I do now), I tell them I did MRI methods research. But what I do is very different to the image that comes up in people’s minds when I tell them this. I don’t build radiofrequency coils for MRI scanners, nor do I write MR sequences or even develop new analysis methods. I spent the majority of my PhD making small, incremental changes to the way MRI data is acquired and analyzed, and then testing how these changes affect how we measure certain things in the brain.

This type of research exists within what I consider a spectrum of phases of clinical research (NB: this has nothing to do with the phases of clinical trials of interventions – it’s only “clinical” in the sense that the research is being done on humans):

1. New methods are developed

2. They are validated by testing how well they perform in certain settings and improvements are made accordingly (followed by more validation).

3. Then, when they’re good and ready (this can take years), they’re used to answer clinical or biological questions.

People often neglect the second phase – the validation, improvement, and re-validation. It’s sometimes completely overlooked, but arguably the bigger problem is that it’s often conflated with the latter stage – the testing of clinical or biological hypotheses. The line between these phases is often blurred and when, as a researcher, you try to emphasize the separation of the two, it’s considered pedantic and dull.

Several types of scenarios exist – for example, you can have a method that measures a phenomenon X in a different way to an established method or you can have an operationalized measurement of phenomenon X (i.e. an indirect measurement, almost like a surrogate marker). The key question has to be: am I measuring what I think I’m measuring? This can be done by directly comparing the performance of the new method to a more established method, or by testing to see if that method gives you the results you would expect in a biological or clinical situation that has been previously well studied and described.

For the record, I think the second option, although indirect, is completely valid – taking a method that’s under development and testing if it reflects a well-established biological phenomenon (if that’s what it’s meant to reflect) – that still counts as validation (I’ve done this myself on several occasions – e.g. here, here, and here). But the key thing is that it has to be well-established. Expecting a method you’re still trying to comprehensively understand to tell you something completely – or even mostly – new makes no sense.

Unfortunately, that’s often what’s expected of this kind of research. It’s expected from the researchers doing the work themselves, from their colleagues and/or supervisors, from manuscript reviewers, as well as from funding agencies. The reason is simple (albeit deeply misguided), and it confronts researchers working on improving and validating methods in clinical research very often: people want you tell them something new about biology or pathophysiology. They very often don’t want to hear that you’re trying to reproduce something established, even if it is with a new method that might be better for very practical reasons (applicability, interpretability, etc). This has presented itself to me over the years in many ways – reviewers bemoaning that my studies “provide no new biological insights” or well-meaning colleagues discouraging me from writing my PhD dissertation in a way that makes it sound “purely methodological” (“you need to tell the reader something new, something previously unknown”).

The irony is that, in the years I’ve spent doing (and reading) imaging research, I’ve become fairly convinced that the majority of clinical imaging studies should fall into the second category mentioned above. However, it’s often mixed up with, and presented as though it belongs to, the third category. Researchers use new method Y to test a “novel” hypothesis, and interpret the results assuming (without proper evidence) that method Y is indeed measuring what it’s supposed to be. I notice this when I read papers – the introduction talks about the study as if its aim is to test and validate method Y, and the discussion ends up focusing on all the wonderful new insights we’ve learned from the study.

To be clear, I’m in no way saying that the ultimate goal of a new method shouldn’t be to be taken to studies of the third category. Validate, improve, validate, then apply with the hope of learning something new – that’s should clearly be the goal of any new method meant for clinical use. But we shouldn’t expect both to be done simultaneously. Instead, we need to acknowledge the clear separation between the types of clinical research and their respective goals, and to recognize that not all research is new and exciting in terms of what it tells us about biology or pathophysiology.

Science Pokerface

There’s no such thing as a perfect research study. Every study has its strengths and weaknesses, and as scientists-in-training, we learn that discussing the limitations of our work is an essential part of presenting our work to the scientific community. The reality, however, is that this often is seen negatively. It’s an unfortunate paradox that can make it very difficult, especially for early career researchers, to reconcile what they’ve learned is the right thing to do with what gets their work published and out there for people to read.

I recently submitted, with my graduate student, my first manuscript as a senior author to a scientific journal. It’s a simple study, with an important (but not “fatal”) limitation. As I often do in my own papers, I encouraged my student (the first author) to openly embrace this limitation in the discussion section. To explain how, despite this limitation, the study is still valuable.

I’ve had all kinds of misguided reactions from co-authors when I do this. Let the reviewers point it out, don’t actively make your own study weaker, too much “honesty” can be a bad thing.

I find it very hard to be understanding when I receive such comments. If I figured it out, most likely a reviewer, or worse yet, a reader of the future paper, will too. My main issue with this approach though is that it serves no constructive purpose.

Science isn’t a poker game – I should never feel like I have to bluff or hide something from anyone. I strongly believe that, if we can all agree that no one study is perfect, hiding a study’s imperfections in the hope that someone won’t notice is sneaky and counterproductive.

So what do you do when you receive comments from reviewers like the ones we did? One reviewer basically said “Well, the authors basically point out the flaws in their study and so I don’t see any point in it being published”. He or she offered no other constructive comments. Apparently they didn’t feel it was even worth properly reviewing for this same reason.

This says a lot about our strategy for disseminating scientific knowledge – try to get your paper out there, and if you have to hide something to do that, then so be it. That’s in and of itself problematic, but for early career researchers, this can be a huge source of confusion. You’re telling me that limitations belong in my discussion section, but I shouldn’t be too critical (whatever that means) of my own work? Where do we draw the line? What do I tell my students? My only explanation for them so far is that, regardless of how much experience some people have, they just don’t understand what science is about.

Clinical trials and the RRRR cycle

IMG_20180516_102631

I just got back from one of the world’s largest stroke meetings, the European Stroke Organisation Conference (ESOC), held this year in Gothenburg, Sweden. The overwhelming focus of the conference is on groundbreaking large clinical trials, reports of which dominate the plenary-like sessions of the schedule. One thing I’ve noticed about talks on clinical trials is how, every year, the speakers go to great lengths to emphasize some (positive) ethical, methodological, or statistical aspect of their study.

This is the result of something that I like to call the RRRR cyle (pronounced “ARRARRARRARR” or “rrrr” or “quadruple R” or whichever way won’t scare/excite those in your immediate vicinity at that moment in time). It usually starts with loud opposition (reprimand) to some aspect of how clinical trials are run or reported. This usually comes from statisticians, ethicists, or more methodologically inclined researchers. Eventually, a small scandal ensues, and clinical researchers yield (usually after some resistance). They change their ways (repentance) and, in doing so, become fairly vocal about what they’re now doing better (representation)*.

Examples that I’ve experienced in my career as a stroke researcher thus far are:

Treating polytomous variables as such instead of binarizing them (the term “shift analysis” – in the context of outcome variables – is now an indispensable part of the clinical stroke researcher’s lexicon).
Pre-specifying hypotheses, especially when it comes to analyzing subgroups.
Declaring potential conflicts of interest.

Most of these practices are quite fundamental and may have been standard in other fields before making their way to the clinical trial world (delays might be caused by a lack of communication across fields). Still, it’s undoubtedly a good thing that we learn from our mistakes, change, and give ourselves a subtle pat on the back every time we put what we’ve learned to use.

The reason I bring it up is, maybe soon someone** could start making some noise about one of the following issues that come up way too often in my opinion:

(Mis-)interpreting p-values that are close to 0.05, and how this is affected by confirmation bias.

In the SAME talk:

A stat non-significant result that goes AGAINST the study's expectations = we can't really interpret this, underpowered

A stat non-significant result that CONFORMS to the study's expectations = interesting, more n would have given p<0.05 #ESOC2018 pic.twitter.com/wgjFQjITUm

— Ahmed Khalil (@AhmedAAKhalil) May 18, 2018

Testing if groups are “balanced” in terms of baseline/demographic variables in trials using “traditional” statistical methods instead of equivalence testing.

As the ESOC meeting keeps reminding me, a lot can be done in a year. So I’m pretty optimistic we can get some of these changes implemented by ESOC 2019 in Milan!

* If you think this particular acronym is unnecessary or a bit of a stretch, I fully agree. I also urge you to take a look at this paper for a list of truly ridiculous acronyms (all from clinical trials of course).

** I would, but I’m not really the type – I’d be glad to loudly bang the drums after someone gets the party started, though.

A Band-Aid on a Gunshot Wound: Redefining Statistical Significance

Two months ago, a preprint suggested that the scientific community do something huge. The 72 authors of the paper (behavioural economist Daniel Benjamin et al.) recommended changing the threshold for defining “statistical significance” from a p-value of 0.05 to 0.005, claiming that it would help alleviate the ongoing “replication crisis” plaguing psychological and biomedical research.

My PhD student friends and I had a good chuckle about it, lamenting half-jokingly about how, if Benjamin et al. got their way, we’d have a much harder time getting our degrees (publishing scientific papers, which is – unfortunately for us and for science – extremely difficult to do without “statistically significant” results, is a requirement for being awarded a PhD at the Charité).

But jokes aside, there are several reasons why it’s a bad idea. And it’s crucial that early career researchers, in particular, understand why.

A few days ago, we responded to Benjamin et al. In our paper, we point out that there is little empirical evidence that changing the statistical significance threshold will make studies more replicable. Their recommendation also distracts from the real issues and can have harmful consequences on how resources for research are allocated.

So what do we suggest instead? Just the uncomfortable truth – that there is no quick and easy fix. After countless hours brainstorming ideas, designing experiments, collecting data, and reading up on the latest research, scientists don’t have much left in the tank. Few of us actually think carefully about how we use statistics, and far fewer still do any more than relying on a strictly binary (significant vs non-significant) interpretation of the p-value. Those little asterisks on our plots and tables save us a lot of mental effort.

But this just doesn’t work – it never will, not with any blanket threshold. Properly interpreting our data will require us to do more – even after we think we’re finished and ready to write up the results. This makes our jobs harder, but there’s simply no avoiding it if we’re to practice good science. What’s worse is, we don’t yet know exactly what “doing more” entails – we have some ideas, but there’s a ton of work to be done figuring this out.

So what’s the harm in changing a threshold if the alternative is scratching our heads pondering a bunch of other things no one can even agree on?

Well, keep in mind that shifting the p-value threshold isn’t just about statistical rigour. It can have far-reaching consequences like influencing which studies get done (favouring “novel” instead of replication studies) and encouraging the wasteful use of animals in preclinical research.

Also, Benjamin et al.‘s suggestion takes us one step forward and two back. I’ve only been in science a few years and even I’ve noticed the progress that has been made in how we think about and use statistics. Few can argue with the fact that this has largely been due to a bottom-up effort, with more junior scientists driving the change.

Now here comes a paper penned by dozens of leading (and senior) scientists that shifts the focus back on p-values and effectively says: “Let’s just change this number and everything will be tip-top – go about with your science as you were.” Think of what that does to the graduate student trying to convince her supervisor that not everything that glitters (***) is gold.

There’s absolutely no doubt in my mind that Benjamin et al. do not intend their recommendation to be interpreted in the way I described above. But that’s irrelevant. The following has been quoted to the point it’s almost lost its meaning, but the inventor of the p-value himself never intended it to be used the way it has. That didn’t stop people from misusing it – and this might very well happen again if the new threshold lulls people into a false sense of security. We, the scientific community, should know better.

I know what you’re thinking – why not leave this whole discussion to the pros? But it can’t be entirely up to statisticians – after all, they’ve been trying, mostly unsuccessfully, to cure us of our obsession with p-values forever.

One thing’s for sure – by no means just take anyone’s word for it. Read up on both arguments, educate yourselves. I know most of us feel like we don’t know enough about statistics to even think about things like this. But they are an essential part of our work as scientists, so it’s our duty to inform ourselves and put in our two cents.

Sense and simplicity in science

I recently finished Atul Gawande’s book The Checklist Manifesto, which I highly recommend. It’s all about how very simple measures can have profound outcomes in fields as diverse as aviation, construction, and surgery.

What struck me the most about it wasn’t the author’s endorsement of using basic checklists to ensure things are done right in complex scenarios. Instead, it’s Dr Gawande’s insistence on testing the influence of everything, including a piece of paper with 4 or 5 reminders stuck to his operating theatre wall, that I found inspiring.

Why bother collecting evidence for something so apparently simple, so clearly useful, at all?

Talk of the town

Ischemic stroke, caused by the blockage of an artery in the brain by a blood clot, is as complex as anything in medicine. In fact, for such a common and debilitating illness, we have surprisingly few treatments at hand. Until recently, only two had been proven to help patients who suffered a stroke: giving them a drug that dissolves the clot and keeping them in a “stroke unit” where they receive specialised care that goes beyond what is offered in a general neurology ward.

But that all changed last year. The lectures and posters at the 2015 European Stroke Organisation conference in Glasgow, which I attended, were dominated by one thing. A new treatment for acute ischemic stroke had emerged – mechanical thrombectomy.

In the four months leading up to the conference, a number of large clinical trials had proven that this intervention worked wonderfully. Literally everyone at the conference was talking about it.

Isn’t that obvious?

Mechanical thrombectomy involves guiding a tube through a blood vessel (usually an artery in the groin) all the way up through the neck and into the brain, finding the blocked artery, and pulling out the clot. Just let that sink in for a moment. In the midst of stupendous amounts of research since the mid-90s into convoluted pathways leading to brain damage after stroke, fancy molecules that supposedly protect tissue from dying, and stem cells that we’ve banked on repairing and replacing what’s been lost, the only thing that’s worked so far is going in there and fishing out the clot. That’s all it takes.

After returning to Berlin, I told a former student of mine about the news. “Well, duh?”, she responded, just a bit sheepishly. My first instinct was to roll my eyes or storm out yelling “You obviously know nothing about how science works!”. But is this kind of naïveté all that surprising? Not really. Somehow we’re wired to believe that if something makes sense it has to be true (here’s a wonderful article covering this). As a scientist, do I have any right to believe that I’m different?

Science is not intuitive.

To paraphrase part of a speech given recently by Dr Gawande, what separates scientists from everyone else is not the diplomas hanging on their walls. It’s the deeply ingrained knowledge that science is not intuitive. How do we learn this? Every single day common sense takes a beating when put to the test of the scientific method. After a while, you just kind of accept it.

The result is that we usually manage to shove aside the temptation to follow common sense instead of the evidence. That’s the scientific method, and scientists are trained to stick to it at all costs. But we don’t always – I mean if it makes such clear and neat sense, it just has to be true, doesn’t it?

Never gonna give you up

The first few clinical trials showed that thrombectomy had no benefit to patients, which just didn’t make sense. If something is blocking my kitchen pipes, I call a plumber, they reach for their drain auger and pull it out, and everything flows nicely again. Granted, I need to do so early enough that the stagnant water doesn’t permanently damage my sink and pipes, but if I do, I can be reasonably sure that everything will be fine. But in this case, the evidence said no, flat out.

It works, I’ve seen it work and I don’t care what the numbers say.

Despite these initial setbacks, the researchers chased the evidence for the better part of a decade and spent millions of dollars on larger trials with newer more sophisticated equipment. I’m wondering if what kept them going after all those disappointing results was this same flawed faith in common sense. It works, I’ve seen it work and I don’t care what the numbers say – you hear such things from scientists pretty often.

Another important flaw in the way researchers sometimes think is that we tend to do is explain the outcomes of “negative” studies in retrospect by looking for mistakes far more scrupulously than before the studies started. I don’t mean imperfections in the technique itself (there’s nothing wrong with improving on how a drug or surgical tool works, then testing it again, of course). I’m talking about things that are less directly related to the outcome of an experiment, like the way a study is organised and designed. These factors can be tweaked and prodded in many ways, with consequences that most researchers rarely fully understand. And this habit tends to, in my opinion, propagate the unjustified faith in the authority of common sense.

There’s good evidence to suggest that the earlier mechanical thrombectomy trials were in some ways indeed flawed. But I still think this example highlights nicely that the way scientists think is far from ideal. Of course, in this case, the researchers turned out to be right – the treatment made sense and works marvellously. It’s hard to overemphasise what a big deal this is for the 15 million people who suffer a stroke each year.

Deafening silence

More than a year has passed since the Glasgow conference and this breakthrough received little attention from the mainstream media. Keep in mind, this isn’t a small experiment of some obscure and outrageously complex intervention that showed a few hints here and there of being useful. It is an overwhelming amount of evidence proving that thrombectomy is by far the best thing to happen to the field of stroke for almost two decades. And not a peep. In fact, if you’re not a stroke researcher or clinician, you’ve probably never even heard of it.

Now, if you read this blog regularly, I know what you’re thinking. I rant a lot about how the media covers science, now I’m complaining that they’re silent? But doesn’t it make you wonder why the press stayed away from this one? I suppose it’s extremely difficult to sell a story about unclogging a drain.

The best thing to happen to the field of stroke for almost two decades.

Signing the Dotted Line: Four Years of New Beginnings

Yesterday marked four years since I moved to Europe and started my master’s. I tend to forget these kinds of milestones, but was reminded just a few days ago by one of my best friends, who admirably always seems to be on top of such things (she wrote a wonderful blog post about it).

It’s a bit ironic that I hadn’t realized this was coming up – just last week I wrapped up the latest issue of our graduate program’s newsletter. It’s a celebration of the fifteenth anniversary of the program, and my editorial team and I spent a great deal of time reflecting on the past decade-and-a-half in preparation.

In retrospect, I suppose there was nothing exceptionally momentous about the move itself for me. Compared to some of my fellow students, I hadn’t travelled particularly far, nor was Europe very unfamiliar to me. But the 21st of August 2012 is when I started, as the internet would put it, learning how to adult.

I tell this anecdote all the time, but perhaps it’s worth mentioning one more time. During our master’s program’s orientation week, we were given contracts to sign for the scholarships we were about to receive. They essentially stated that we’re committed to seeing out the whole two years of the program. I stared at that thing for what seemed like an eternity, taking it with me on a walk around the Université Bordeaux Segalen campus.

Two years, I thought – two whole years. That’ll feel like ages – not that I wasn’t going to sign it anyway, it was an incredible opportunity and I knew it. A week and a fourteen-hour train journey later, I was sitting on a bench waiting for the staff to prepare my hostel room. The street, Schönhauser Allee, has since become one of my favourite places in Berlin. I sat there thinking about how the next few years would pan out. I didn’t realize at the time how, like most things in the city, the train whizzing past was paradoxical, an “underground” line running half a dozen metres above street level.

Since then, time has hurried on just like that train – studying in two different cities (three if you count Edinburgh, which, as the initial inspiration for this blog, I definitely do) and simultaneously adapting to both a new career and a life outside my childhood comfort zone.

Almost two years ago, I once again signed a similar piece of paper for my PhD without so much as batting an eyelid.

The Fault in Our Software

Although the vast majority of scientific articles fly well below the radar of the mainstream media, every once in a while one gets caught in the net. A few weeks ago, a research paper attracted a lot of public attention. It wasn’t about some fancy new drug running amok and causing dreadful side-effects or a bitter battle over the rights to a groundbreaking new technology. It was a fairly math- and statistics-heavy paper that found a flaw in an analysis program used in neuroimaging research.

Soon after the article came out and the media took hold of the situation (with gusto), I received a flood of emails, messages, and tags on Facebook and Twitter. These came from well-meaning friends and colleagues who had read the stories and were concerned. So what was all the fuss about?

The headlines were along the lines of (I’m paraphrasing here but if anything my versions are less dramatic, just google it) “Decades of brain research down the drain”. Several scientists have already come out to explain that the whole thing has been blown out of proportion. In fact, it’s a typical example of irresponsible science reporting (see this previous post). After all, people love a good story. And that’s often all that matters.

Inaccurate reporting of science is nothing new.

The “damage” is exaggerated.

Not to state the obvious, but I feel like it’s worth emphasizing that it’s not all brain research that is affected by this bug. Brain imaging is a great tool, and over the past few decades its use in neuroscience has flourished. But neuroscientists use many, many other techniques to investigate the brain. This bug has nothing whatsoever to do with most brain research.

It’s not even all imaging research that’s affected by the bug. We have so many different neuroimaging techniques – like PET, CT, NIRS, SPECT – that I’m expecting we’ll run out of palatable acronyms soon. MRI is just one of them, and functional MRI (fMRI) is a single application of this imaging technique.

A new take on an old problem.

Not since the infamous Case of the Undead Salmon (2009) has fMRI attracted so much criticism and attention. Actually, both the salmon study and the paper describing the bug are similar. The flaws they highlight mainly pertain to what is known as task-based fMRI.

Here, what essentially happens is a subject is presented with a stimulus or instructed to perform a task. The resulting tiny changes in blood flow and oxygenation are disentangled from the brain’s massive “background” activity and all kinds of other (for these purposes) irrelevant signals from inside and outside the body. In fMRI, the brain is divided up into many small units of space called voxels. To find out if the tiny changes caused by the stimulus are distinguishible from the background, statistics are applied to each voxel (there are tens of thousands).

However, every time you run a statistical test you have a certain chance of getting a false positive, and the more times you run the test the higher that chance becomes. Some form of correction for doing this test many times needs to take place. In a nutshell, the Undead Salmon paper showed that if you don’t apply a correction at all, you’ll see things going on in the brain that should definitely not be there (because the salmon is … well, dead).

The new paper showed that one approach used to limit the number of false positives, implemented in several commonly used fMRI analysis programs, doesn’t work. This failure was caused by two things – a bug in the code of one of the programs and because, as the paper showed, fMRI data violates an important statistical assumption needed for the approach to be valid (basically, because the characteristics of the data do not fit the analysis strategy, the result is unreliable).

Both a bug in the code and an inherent problem with the analysis are to blame.

The reality in my case.

After reading the news, I read the actual paper. Several times, in fact, and I’m not completely sure if I fully understand it yet. It’s not really my research focus. Although I do use fMRI, I do it in an entirely different way. My project actually repurposes fMRI – which is one of the reasons why I like it so much, because I get to do a lot of creative/innovative thinking in my work.

It also comes with the seemingly obvious yet still underestimated realization that making something new – or putting an existing technique to new use – is very, very hard. In my field my peers and I rely heavily on people far smarter than me (this isn’t humility, I’m talking objectively smarter here). These are the biomedical engineers, physicists, statisticians, computer scientists, and bioinformaticians who develop the tools used in neuroimaging research. Ask any imaging researcher – these people are not “para-researchers” – their role is not “just” supportive, it’s fundamental.

Hoping the hyperbole brings about change.

The trouble is, most of the time we use these tools to test hypotheses without thinking much about how they’re doing the things that they do. That’s understandable in a way – these things can be very, very complicated. It’s just not what biomedical researchers (especially those with a medicine/biology background) are trained to do.

The stakes are high for research software developers.

But incidents like these give us reason to stop and think. It’s a fact that people make mistakes, and if your role is as important as developing a tool that will be used by thousands of researchers, the stakes are much higher. When I mess up, the damage is usually far less widespread and hence controllable.

But that doesn’t mean we can’t do something to help. As the authors of the article pointed out, this problem would probably have been discovered much earlier if people would share their data. If the data were accessible, someone would have realized that something was amiss much sooner, not after the software had been in use for a whole 15 years and thousands of papers had been published.

Data sharing would have limited the damage.

Many research software developers encourage people to use their tools and openly share their experience with others so that bugs can be identified and fixed (all 3 software programs assessed in the paper are open source). Sadly, the way we currently do science is stubbornly resistant to that kind of policy.

Reflections on the Mechanics of Research

As I sit in my room on a lazy Sunday afternoon, I start to think about my last blog post – it’s been three days already. “I need to keep up the momentum”, I tell myself. That drive that, less than a week ago, relaunched this blog after I procrastinated for ages. If I don’t write something now, I won’t again for months, so here I am. But this isn’t a post reflecting about my writing habits, it’s about me and my peers – specifically, why we do things the way we do.

I’m writing this because I realize that I’m fortunate to work with incredible people who know many other incredible people and who enjoy collaborating and discussing ideas. Which means that, in my PhD, things are constantly in motion, new ideas pop up almost every day and there is always lots and lots to do.

That brings me back to a ubiquitous concept – momentum. In scientific research, after an idea is proposed, a plan is made to explore this idea. But every scientist knows the familiar feeling – weeks, months or years later the initial excitement fizzles out. Over the past few years of doing research, I’ve come to realize that this loss of momentum can’t be blamed solely on failure, it’s a combination of several different things.

Looking back, the idea just doesn’t seem as exciting as it used to.

Not because you later realize that someone has already tried it or that it’s generally a useless idea. It’s just not novel anymore – neither to you nor to the people with whom you’re working. Simply, it stops being new and therefore stops being attractive.

When the idea is first proposed, you know little about it – the possibilities are endless. Then during the research process, you (hopefully) learn more. And that should be enough to keep us going – as scientists we like to think that we find pleasure and motivation in the pursuit of knowledge. But the more familiar the topic becomes, the less drive we have to follow that particular line of research.

This is counter-productive of course because, although the idea is now familiarly boring to you, it is probably still novel and interesting to the scientific community. So the idea itself hasn’t lost its merit, but your valuation of it has diminished.

Is novelty more important to scientists than the pursuit of knowledge?

We lose confidence in the goals we set out to achieve.

It’s not that we lose confidence in our ability to achieve these goals. Although that’s also something that is extremely common and important, it’s a more gradual process that affects some people more than others. What I mean is that the goals themselves seem less within reach because of trivial and often illogical reasons.

Every setback – however small or easy to overcome – leaves a lasting mark on how likely we view a goal as “achievable”. It seems absurd that the fact that a reagent turned out to be faulty and you wasted three weeks running Western blots in vain should affect how likely protein X protects against disease process Y. But this rather subtle logical fallacy (not sure if there’s a specific term for it in psychology, but there should be) commonly affects researchers and can be devastating.

This cumulative process sneaks up on you – seemingly harmless assumptions about your data pile up and little workarounds coalesce into the stuff of nightmares. Until finally, a swarm of bees invade the lab because someone forgot the window open and you just go home and give up on the entire project.

To me, proof that these small frustrations are responsible for destroying good ideas is that the people “in the trenches” are most prone to this loss of momentum. The undergraduate, MSc, and PhD students doing the hands-on work. I don’t think it’s because they’re less experienced and therefore less resilient. It’s just that they’ve seen things during the course of a project that their seniors, preoccupied with “bigger picture” thoughts, have not. Things that eventually have these poor souls perpetually repeating “Yeah, it was a nice idea – but it’s just not that simple“.

Give us a little push, and we’re off.

The problem may be that researchers, myself included, have low inertia – it’s easy to get us excited. Indeed, it’s very easy for us to excite ourselves as well (“What a great idea, if this works it could be groundbreaking!”). At least in the beginning when an idea is still fresh. This intellectual enthusiasm is a fundamental characteristic of a scientist, and I’m not saying it’s a bad thing.

What’s obviously bad is not pausing to think – really think – about an idea before diving into testing it. The sad thing is, the vast majority of researchers do stop and think. We list every possible outcome and every setback we can think of, until we’re convinced that we’re prepared for what’s to come. But often it’s still not enough to keep the momentum going down the road.

At this point in my career, I’m not sure if our low inertia is the gift/curse that will bless/haunt researchers forever no matter what we do, or if we can learn to be better at taking advantage of its perks and avoiding its drawbacks.

Scientists have low inertia.

I’d really like to hear what other scientists think about all this – because it could be just a matter of personality. Perhaps I’m an impatient defeatist with low self-confidence (trust me, I’m at least a little bit of each of those things) and thus, everything I mentioned applies to people like me but not the vast majority of researchers.

I have no clue – all I know is, even as I’m writing this post, it’s starting to look less like something that should be read by intelligent people and more like the incoherent ramblings of a frustrated graduate student.

Featured image: “Blurred motion Seattle Wheel at dusk 1” by Brent Moore, Flickr http://bit.ly/1UikTFI

Ahmed A. Khalil

MD PhD