In December 2019 the first cases of pneumonia of unknown etiology were reported in Wuhan city, People’s Republic of China (World Health Organization, Novel Coronavirus (2019-nCoV) SITUATION REPORT 1 – 21 JANUARY 2020). Since the outbreak of the disease, officially called COVID–19 by World Health Organization (WHO), a multitude of papers have appeared. By one estimate, the COVID-19 literature published in January-May 2019 has reached more than 23,000 papers — among the biggest explosions of scientific literature ever.

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (Wang, Lucy Lu, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, et al. 2020. “CORD-19: The Covid-19 Open Research Dataset.” arXiv Preprint arXiv:2004.10706), a resource of over 134,000 scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. The Center for Systems Science and Engineering at the Whiting School of Engineering, with technical support from ESRI and the Johns Hopkins University Applied Physics Laboratory, is maintaining an interactive web-based dashboard to track COVID-19 in real time (Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. “An Interactive Web-Based Dashboard to Track Covid-19 in Real Time.” The Lancet Infectious Diseases 20 (5): 533–34). All data collected and displayed are made freely available through a GitHub repository. A team of over one hundred Oxford University students and staff from every part of the world is collecting information on several different common policy responses governments have taken. The data are aggregated in The Oxford COVID-19 Government Response Tracker (Hale, Thomas, Samuel Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. 2020. “Oxford Covid-19 Government Response Tracker.” Blavatnik School of Government). Google and Apple released mobility reports to help public health officials. Governments all over the world are releasing COVID-19 data to track the outbreak as it unfolds.

It becomes critical to harmonize the amount of heterogeneous data that have become available to help researchers and policy makers in containing the epidemic. To this end, we developed the COVID-19 Data Hub, designed to aggregate the data from several sources and allow contributors to collaborate on the implementation of additional data providers. The goal of our project is to provide the research community with a unified data hub by collecting worldwide fine-grained case data, merged with exogenous variables helpful for a better understanding of COVID-19.

The data are hourly crunched by a dedicated server and harmonized in CSV format on a cloud storage, in order to be easily accessible from R, Python, MATLAB, Excel, and any other software. The data are available at different levels of granularity: 1) administrative area of top-level, usually countries; 2) states, regions, cantons; 3) cities, municipalities.

The first prototype of the platform was developed in spring 2020, initially as part of a research project that was later published in Springer Nature and showcased on the website of the Joint Research Center of the European Commission. The project was then started at the #CodeVSCovid19 hackathon in March, funded by the Canadian Institute for Data Valorization IVADO in April, won the CovidR contest in May, presented at the European R Users Meeting eRum2020 in June, and published in the Journal of Open Source Software in July. At the time of writing, we count 3.43 million downloads and more than 100 members in the community around the project.

COVID-19 Data Hub has recently received support by the R Consortium, the worlwide organization that promotes key organizations and groups developing, maintaining, distributing and using R software as a leading platform for data science and statistical computing.

We are now in the process of establishing close cooperation with professors from the Department of Statistics and Biostatistics of the California State University, in a joint effort to maintain the project.

Emanuele Guidotti & David Ardia


Guidotti Emanuele / Ardia David, COVID-19 Data Hub, Journal of Open Source Software, 2020.

Auteur(s) de cette contribution :

Page Web | Autres publications

PhD student in Finance at the University of Neuchâtel. Emanuele is partner at Algo Finance Sagl, software house start-up developing financial algorithms for the asset management industry. Passionate about interdisciplinary research fields at the intersection of Finance, Data Science and Statistics.