The scientific method: it all starts with a simple, essential question. How can you "know" something?
How can we gather knowledge and have confidence in the correctness of such knowledge? The lucubration of many smart minds over the centuries came to refine the following:
- You can never know something with certainty.
- You can propose a hypothesis.
- You then proceed to test that hypothesis with experiments.
- Each experiment must attempt to show that the hypothesis is false.
- The more you fail at showing that the hypothesis is false, the more confidence you gain in the hypothesis.
- If you ever succeed to show that hypothesis is false, then you get rid of the hypothesis, let others know, get more coffee, and go back to step 2.
The reason why, contrary to popular belief, is it not possible in step 4 and 5 to simply run an experiment to show that the hypothesis is true, is that the outcome of an experiment could also be explained by hundreds of other hypotheses, and therefore, our current beloved hypothesis is not the only possible explanation for what we observe. The fact that the outcome of an experiment is consistent with what the hypothesis would predict does not imply that the hypothesis is correct. Instead, such consistency only give us some confidence on the fact that we can continue using that hypothesis as a good description of how the real world works. An experiment that fails to prove that the hypothesis is false, is simply one piece more of evidence that is consistence with the hypothesis.
Such is the scientific method that we hold today as the best method to "know" something. The best hypotheses we have are the ones that have survived hundreds of repeated attempts to prove them false. We group hypotheses into larger bodies of knowledge that we call theories, and we like the theories for which we continue to fail in our repeated experimental attempts to prove that they are false.
Of course, the experiments that we run in Step 4 may be imperfect. Experimental errors may creep in. Computation errors may have slipped in the data analysis. Software bugs may have been present in the program that we used to compute the outcomes. Lack of coffee may have misled the observation of the experimentalist, or gremlins and leprechauns may have been perturbing the measurement instruments... Therefore, before we trust the outcome of the experiment itself, we must repeat it, and we must do so many times. Enough times to make the argument that it is unlikely that the outcome is a simple statistical outlier. The repetitions must also be done by others, under a variety of conditions, to ensure that our mindset has not biased the experiment, and the analysis that leads to our conclusions. Independent verifications also help to discard the potential effects that may result from our particular set of equipment, and the local conditions under which we performed our experiment.
When the experiments are replicated by others under processes that are independent of ours, and arrive to similar outcomes, we gain a lot of confidence in such experiments.
This is no news to software developers.
As developers, we are not satisfied with "the program runs in my computer," nor the lazy "it works for me"explanation. We demand that the program must be tested in different computers, by different people. We expect the code to include tests, and the tests to have been run by many different people. Only then do we trust the code.
It is essential, therefore, for the proper functioning of the scientific method to have a system of dissemination of information by which researchers can share with others the full description of their hypothesis, and the precise recipe used to perform their experiments. Armed with that precise recipe, other researchers are expected to repeat the experiments, and in turn report on the outcome of such experiments. Sometimes the outcomes will match, sometimes they will not, and then all the researchers involved are supposed to look at the details of the process to figure out the causes behind the differences.
The information dissemination tools used by the scientific community have changed over the centuries. Initially, they were simple letters sent to other scientists. Then, they evolved into publications printed by scientific societies that were focused on a particular scientific domain. The content of the publications was supposed to include the full details on how someone else could repeat the experiments. The recipients of the publications would attempt the recipe and then publish back their results. These process of experimental verification of reproducibility came to be called “peer review.” It was typically done by your "peers," those other geeky scientists like you, who had a preference for spending time in the labs in the basement of the physics building over the pleasures of going to the beach on a Sunday.
That's how the scientific method is supposed to work. Galieo Gallilei built a telescope, pointed it to the sky, found that Jupiter had four satellites, wrote a letter about it, and sent copies to his peers. The peers built similar telescopes, pointed them to the sky, and observed said satellites around Jupiter. Life was good! But that was the 1600s.
Since then, the professionalization of scientific research, and its emergence as an economic activity that brings funding to academic institutions, promotes activities in R&D firms, and provides a way of building careers and the means of making a living to a significant number of people led to a degradation on the standards of quality in the process. The rigorous peer review that used to include the full replication of the experiment degraded into a cozier peer review in which the peer now takes the liberty of skipping the verification of the experiment. After all, what's the point of running an experiment if I already "know" what the outcome is going to be? (this is sarcasm)
In today's environment of academic publishing, and its associated process of "peer review," the rigorous tradition of verification of reproducibility has been degraded and diluted. Too often peer reviewers base their reviews solely on their reading of a particular submission and previous experience, they do not repeat the experiments, and then go on to pronounce opinions on the correctness and value of the paper. The time that reviewers dedicate to writing a review is often in the range of two to four hours. This includes the time to read the paper and the time dedicated to figuring out a set of reasonable criticisms that show the editor that the reviewer was indeed qualified to have something to say about the paper. The activity of review is also an unpaid, and normally an anonymous one. Many times reviews are actually performed during flights and waiting time in airports, or after hours and weekends. This approach to “peer-review” has been degraded and devalued to the mere venality of “you say, I say.” Such reviewers are by definition “Pundits,” not scientists.
The excuses, of course, are many. “It’s too expensive to repeat the experiment.” “It takes too long.” “I have another paper to write so it gets to be reviewed by others to feed the eternal cycle of publish-or-perish.”
Knowing that such is the process, authors don’t bother with describing in their papers the "little details"on how to replicate their experiments. Some authors don’t even bother with repeating their own experiments a couple of times before publishing, so they don't get to discount the known interference effects of gremlins and leprechauns. Some authors go a step further and take the liberty of skipping the experiment altogether, since, after all, what's the point of running an experiment if they already “know” what the outcome is going to be? (sarcarm here again)
Eric Raymond made the point that "open source is the application of the scientific method to the process of software development." In particular, the aspects of information sharing, educational commitment, peer review, and real experimental verification are ingrained in the way open source software projects run their daily business. These concepts were consciously or unconsciously copied from the way scientific communities used to run their operations. Today, there is a thing or two that open source communities can bring back to academic communities when it comes down to remembering how is that knowledge is to be obtained, verified, and shared.
Many academic communities would benefit from Linus Torvalds simple dictum, "Talk is cheap; show me the code."
That reminds us that just talking and providing opinions about scientific topics is not enough. The means and practices of verifying the claims made in the talk must be part of the scientific and academic discourse.
In my next post, I’ll cover how today’s academic institutions, scientific and technical societies, academic publishers, funding agencies and scientist implement the peer-review system and what are the motivations and the rewards systems that animate their interactions.