How Wikipedians put Researchers to Shame
While searching online for the latest figures of Web Browsers market share,
I came accross this interesting figure in the Wikipedia:
As interesting as this figure is, for driving endless geek discussions, the same Wikipedia page held a Gem of much greater interest!
Under the section “Summary“, with the description:
We find indeed the R source code:
Followed by the instructions on how to run the code in Linux
This is the Hallmark Signature of the true Scientist:
Describe the process of the experiment and provide all the elements needed for an independent observer to replicate the work.
This illustrates how we already have at hand all the tools needed to implement Reproducible Research on a daily basis, and we are mostly confronted with a Cultural shift, on making clear that is no longer acceptable to publish papers that do not include the full set of elements needed to verify its replicability.
Of course, it is not enough, to “take their word for it…”, so, I followed the instructions, copy-pasted the R code, in a file, and ran it in an Ubuntu Linux machine, as Rscript webbrowsers.r and got the following replicated svg file:
“Usage share of web browsers (Source StatCounter).svg”
Tip of the Hat to to the Wikipedia page authors:
Thanks for showing how scientific work ought to be performed and publihsed !
“Nullius in Verba”
“Take Nobody’s word for it”
Stepan Roucka rightly pointed out that in the R example above, the data was
already embedded in the script and therefore, it was not a clean example of
reproducible research. Particularly because the provenance of the data was
not explicit, and it was not clear how the data was put into the script.
This addendum attemps to address Stepan’s point, this time using an
iPython script, and putting all the elements together int this git repository:
Where the data has been downloaded from StatCounter, using the URL listed here:
and the same URL is used directly in the python script,
taking advantage of pandas availity to download/read files using an URL.
The full script is now:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read data from the local disk
# browsers = pd.read_csv('../data/browser-ww-monthly-200812-201402.csv')
# Or download it from StatCounter directly
browsers = pd.read_csv('http://gs.statcounter.com/chart.php?201402=undefined&device=Desktop%20%26%20Mobile%20%26%20Tablet%20%26%20Console&device_hidden=desktop%2Bmobile%2Btablet%2Bconsole&statType_hidden=browser®ion_hidden=ww&granularity=monthly&statType=Browser®ion=Worldwide&fromInt=200812&toInt=201402&fromMonthYear=2008-12&toMonthYear=2014-02&multi-device=true&csv=1')
somebrowsers = browsers[['IE','Firefox','Chrome','Safari','Opera']]
The repository also has the corresponding IPython notebook script.
that generates the following figure:
Thanks Stepan for pointing this out.