A large software system is one of the most complex machines ever built. When you combine complex software with big data, the number of possible execution paths are inconceivably huge. So how is it that software engineers can build and run these enormously sophisticated machines? (Yes I agree these systems are far from perfect, but in general I’ve seen significant progress over the many years that I’ve been in the field, so please bear with me.)
There are several answers to this question, ranging from encapsulation (e.g., object-oriented approaches used to hide complexity behind interfaces), to distributed processing technologies (enabling simple components to interoperate), to good software practices (such as code generators), to test-driven software processes (like the CTest/CDash process); all of which have been well-known for many years, as even a brief search will tell you [ [MacDonald] [Kuhn]. The point is that as software developers, we realize that the technology foundations we build are critical to making future progress and that managing complexity is central to our practice. We know that our systems must generate reproducible results if we are to depend on them; otherwise applications built on them are worthless, and may result in grievous harm in the worst case.
With this generally successful example of managing complexity through reproducibility behind us, what are we to make of the recent dismal report found in Nature which states that on the order of 90% of papers published in science journals describing “landmark” breakthroughs in preclinical cancer research describe work that is not reproducible, and are thus just plain wrong? (C. Glenn Begley, formerly head of cancer research at pharmaceutical company Amgen, and Lee M. Ellis a cancer researcher at the University of Texas, authored this astonishing paper.)
Repeat after me: approximately 90% of paper results were not reproducible. I don’t know about you, but this paper result, if true, is one of the most disturbing scientific findings that I have ever read. Just think of what this means to pharmaceutical companies and other healthcare providers who are building therapies based on these suspect results; we are talking possible harm to patients, billions of dollars misspent, and years of potentially wasted work. Mostly it shakes my faith in the foundations of this research, and I wonder how we are to advance the scientific frontier and maintain public trust in the scientific method. And once again I’m reminded of the imperative to practice Open Science.
What’s common about medical research and software is that they are both enormously complex undertakings. Being human, there will be errors and biases, and in the worst cases even fraud. The only way to combat these unfortunate human qualities is to provide methods to identify and correct issues. As any open source practitioner and dedicated scientist will tell you, producing reproducible results using transparent methods and data, which are then made publicly available to others, are some of the best ways to do this.
In the past we’ve gotten away with much. Methods and data were hidden or obfuscated behind opaque papers, firewalls, and claims regarding intellectual rights. Often the relatively low-level of complexity was such that outsiders could work around these barriers by reproducing software or repeating a research trial. But times are different now and the increasing sophistication of our technologies and the cost of scientific research means that we have to pull out all the stops to address the looming Complexity Barrier. The scale of the challenge is such that if we do not, our technological progress will be greatly impeded if not halted. To make progress and build the foundations of technology, the Complexity Barrier demands that we do so by building transparent, open, and reproducible results that can be verified and ultimately relied upon. For some scientific method purists, it could be that this paper which repudiates so many results is simply science in action, and given enough time the scientific method always corrects itself. This could be, but my suspicion is that self-correction is now harder to do, with closed systems contributing much of the burden.
Certainly there are moral, ethical, and philosophical arguments to be made here, which are probably not going to motivate some of us on a day-to-day basis. However, as an engineer-at-heart who likes to get things done and make a difference, there is a very pragmatic rationale for doing business and making a living in accordance with the three Open Science principles of Open Access, Open Data, and Open Source. That is: if you don’t practice these principles, it is likely that those who do so will leave you in the dust, whether measured in terms of career accomplishments, innovation impact, or business success. As the common saying “Go Big or Go Home” is used to describe a winning competitive strategy, we now have to mind the corollary “Go Open or Go Home” if we are to compete in the modern world.
In a sadly ironic twist, this important paper is only available through a closed pay-wall. Further, the authors propose that these circumstances are due to the high-pressure research environment that forces researchers to publish or die. Specifically, they say researchers must be more willing to report negative findings in their papers and that research facilities should change their policies regarding publishing. All likely true, but they miss the essential conclusion: science that is not reproducible is not science; we must remove barriers to reproducibility to if we are to hurdle the Complexity Barrier and advance scientific knowledge.
Fortunately some of the most prestigious journals are embracing the practice of reproducible verification, and leading the way towards restoring the practice of scientific research. It is now up to each one of us to support these initiatives and to promote the adoption of Open Science and reproducibility in the technical societies, institutions, and communities in which we participate.