By Guido España
There are several difficulties for researchers who try to replicate findings from other researchers. For instance, the instructions could be incomplete, and the code and data might not be available. ELife’s first computational reproducible article aims to solve this issue by allowing readers to interact with code that generates every figure in the paper. This open-source project (Stencila) is a step towards reproducible research. However, there are other obstacles that authors need to address in their work habits to be able to produce research findings that can be reproduced by others, but more critically, by themselves.
Five obstacles towards reproducible research
Publishing a manuscript is an iterative process with several steps that go from a preliminary idea, an initial draft, and to, hopefully, publication. This long process involves many changes in the manuscript, code, and figures. These changes are due to numerous revisions from collaborators and journal reviewers, or from the researchers themselves. Many things can get on the way of reproducibility. Personally, I have experienced problems with keeping results up to date because often there are misconnections or manual steps in the process of gathering and analyzing data, and creating figures or tables. It is particularly difficult to remember all the steps to create a figure several months after the first draft of the manuscript.
In my opinion, there are five main obstacles in the path of reproducible computational research: 1. Keeping results up to date, 2. Remembering previous versions of the manuscript, 3. Collaborating with co-authors, 4. Responding to reviewers, 5. Sharing reproducible research with other researchers.
The first step towards reproducibility is file management. It is very important to name files in such a way that they are easy to understand for humans and computers. Files also need to be easy to recall. For humans, this means that names should be meaningful, for computers, this means that files should not contain strange characters (spaces, commas, periods, or others like #$%^). Directory structure is important to keep research files organized. For instance, to differenciate the manuscript files from the scripts and data, one could have a main directory (
GNU make is a tool to compile files (often executable files) that depend on many other files. GNU make is often used For emacs enthusiasts, org-mode is a powerful alternative. See this tutorial for an introduction to manuscript writing with org-mode.
Using LaTeX (or knitr, or org-mode) would be ideal for anyone collaborating with a team of LaTeX users. However, this isn’t always the case. In many fields, MS Word is the prevalent word processor. If you want to use the capabilities of git and LaTeX, but you still want to share word documents with others, pandoc is probably the best alternative to convert between .tex and .docx documents. A brief introduction for this purpose is Alexander Branham’s guide to pandoc. Using pandoc, LaTeX documents can be converted to MS Word to share with collaborators, then their changes can be incorporated into the main LaTeX document.
GitHub or GitLab can be used to create a remote repository and share it with collaborators. If the workflow to create the manuscript has been automated, then sharing with others should be straightforward with GitHub or GitLab, where other researchers could replicate the manuscript results and use them in their own research. To share with the scientific community, a manuscript published in a scientific journal could include the URL for the GitHub repository with all the necessary steps to create the manuscript.
This post is a brief summary of tools useful for reproducible computational research. A lot of the content on this post comes from Gandrud’s book on reproducible research. I also created some slides and some demos of reproducible manuscripts using the tools mentioned here: LaTeX, knitr, and org-mode. You can download the slides here.
Furthermore, there are many more materials online that are helpful:
From hurricanes to mosquitoes: How epidemic forecasting can support UNICEF’s fight against infectious diseases
by Rachel Oidtman, Alex Perkins, Moritz Kraemer, Manuel Garcia-Herranz, and Elisa Omodei
Originally published on: https://www.unicef.org/innovation/stories/epidemicforecasting
The decision to grab your umbrella as you head off to work in the morning is shaped by the daily weather forecast. Although meteorology has been a formal science since the 19th century, it was only in the last 20 years that these forecasts became accurate enough (>60% accuracy) that we could trust them and use weather forecasts to make decisions. In the background of our daily lives, there has been a “quiet revolution” in weather forecasting, kick-started by a number of technological and scientific advances over the last few decades.
With considerable technological advancements across all scientific disciplines, including a vast increase in the amount of data and more realistic models, meteorology is the only field that has achieved such clear success when it comes to forecasting. This begs the question: What if we could forecast influenza with the accuracy of rain forecasts? Or, what if we could forecast an emerging disease with the accuracy of hurricane forecasts?
Weather forecasting as a model for epidemic forecasting
With hurricanes, meteorologists know the seasons in which they occur, but they cannot begin forecasting the progression of a hurricane system until the first signs of a storm appear (i.e. low pressure system with thunderstorms during hurricane months). Early on in hurricane forecasts, there is significant uncertainty surrounding both the spatial trajectory and severity of the storm. With weekly weather forecasts, meteorologists know seasonal weather patterns and have extensive geographic data on weather variables. There is uncertainty around weekly weather forecasts, but this is much more constrained than hurricane forecasts. Building and improving upon both hurricane and weekly weather forecasts, meteorologists collect more data, invest in better satellites (collecting more precise data), and consider more models.
Learning from the ways in which meteorologists improved both extreme and seasonal weather forecasts, it becomes clear that better models and more data could help to improve disease forecasting. Traditional disease surveillance data though cannot be collected as efficiently and consistently as weather data. Hence, the scientific community is devising more creative options to improve disease surveillance and epidemic forecasting. One such option is to use unconventional data sources (e.g. human mobility data, social media traces, web searches, etc.) to represent biological processes (e.g. pathogen dispersion) to enhance the realism of assumptions about disease spread in forecasts. Another complementary option is to organize forecasting “challenges”, which promote the development of more accurate models and offer a means to compare the accuracies of different models.
Forecasting familiar and foreign foes
Even with the sparse and varied nature of epidemiological data, there has nonetheless been progress in disease forecasting of both recurring seasonal diseases and emerging diseases.
“Familiar foes” that exhibit recurring, seasonal transmission, such as influenza and dengue, have received considerable attention. One of the first major initiatives was Google Flu Trends, launched in 2008, which used Google search queries to forecast flu incidence. Since then, there has been consistent progress as academic researchers have been steadfastly working on developing more and more accurate models. The most notable example is the CDC-developed flu forecasting initiative, “FluSight”, that is used as a platform for visualizing and sharing real-time weekly flu forecasts contributed by various modeling teams around the world. Dengue has also been at the center of attention of the scientific community since the Dengue Forecasting Project was launched in 2015 - a forecasting contest based on historical data.
“Foreign foes” that appear with little warning and can spread rapidly across international borders remain a more elusive challenge for disease forecasting, with recent notable examples including Ebola and Zika viruses. Chikungunya, a mosquito-borne disease that spread across the Americas in 2013-2015, offers another example. While retrospective analyses indicated some success with forecasting the sequence in which different islands in the Caribbean would be invaded, a near real-time forecasting challenge sponsored by Defense Advanced Research Projects Agency (DARPA) showed that most models tended to have difficulty forecasting other features of the epidemic, such as the week in which incidence would peak. Forecasting was also performed in support of the Ebola epidemic around the same time in West Africa, although somewhat controversially.
Human movement and forecasting
A variety of biological characteristics among different pathogens may contribute to variation in how difficult they are to forecast. For example, outbreaks of some diseases depend almost exclusively on human contact (e.g. influenza), whereas others involve frequent spillover from animals (e.g. MERS). One way to advance the science of forecasting across such a wide range of infectious diseases is to focus on challenges that are common to forecasting of many pathogens. One such challenge is capturing the role of human movement in pathogen spread.
Consider an emerging disease, such as Zika. Early on in its epidemic in the Americas, there were a very limited number of locations that had confirmed the presence of Zika virus. Theoretically, if there was no human movement in or out of these locations, then the pathogen would have been restricted to these isolated locations (mosquitoes are also involved in Zika virus transmission but tend not to move very far). Because there is regular movement between locations, Zika virus can spread to new locations with humans as they travel for work, leisure, etc.
Nowadays, this is possible thanks to the huge amount of data that each of us generates using portable technology. Every time we make a phone call or post a picture on Instagram, we are recording our location. If today someone makes a call from Bogotá, and tomorrow a call from Cali, the phone company knows that one person has traveled from one city to the other. Aggregating this information across many users, telephone companies can provide UNICEF and other stakeholders with anonymized data on human mobility patterns across an entire country in real time.
Forecasting vector-borne disease outbreaks at UNICEF
Since spring 2018, the Office of Innovation at UNICEF has been working with academic researchers at University of Notre Dame and Boston Children’s Hospital to meld together epidemic modeling approaches with innovative data sets to help the most vulnerable. Together, we developed a forecasting model for Zika in Colombia’s more than one thousand municipalities. In each municipality, we use a basic model that takes into account the environmentally driven nature of Zika virus transmission, and the accumulation of population immunity as the epidemic grows, to generate incidence forecasts. Across municipalities, we aim at modelling the spread of Zika virus using human mobility patterns that can be obtained from anonymized and aggregated cell phone data like that provided by UNICEF’s partner, Telefónica. To explore the real-time capabilities of this forecasting model, we iteratively fit the model, make forecasts, and update our model as new incidence data becomes available.
As this collaborative venture continues together with UNICEF Colombia, our goal is to develop a user-friendly interface where public health officials from the Colombian Ministry of Health and National Institute of Health can plug in surveillance data and use our technically sophisticated modeling machinery to make forecasts for future Zika epidemics. This information will be used to make faster decisions on planning and prioritizing preparedness and response actions, thereby helping prevent the spread of such epidemics. The plan is to soon extend this work to include forecasts for other vector-borne diseases currently affecting Colombia, such as dengue and malaria. These forecasting tools will become a core component of MagicBox - UNICEF’s open-source software platform that enables collaboration and the use of new data sources and computational techniques, like AI and machine learning, for good. Recent Zika epidemics in Angola and India serve as a reminder that the threat posed by Zika is, unfortunately, not over. In this regard, the results from this collaboration will have direct implications for policy and public health.
This project builds on prior research funded by a RAPID grant from the National Science Foundation and a Branco Weiss Fellowship.
by Guido España
A Shiny App is a package to build interactive web-based applications in R. In science, a Shiny App is a great tool to share research results. In this post, I will walk you through the basic steps to design and publish a shiny app. I will use part of my research as an example.
The only dengue vaccine licensed has shown an issue with post-vaccination infections. Dengvaxia, developed by Sanofi Pasteur, increases the risk of severe dengue in children with no prior exposure to the dengue virus. Given this negative effect, the WHO recommends Dengvaxia only for children with confirmed prior exposure to dengue virus. In our study, we determined the public health benefits of this type of vaccination.
We modeled vaccination in children with a positive result of previous dengue exposure using a wide range of accuracy of the serological screening tests. We explored this vaccination strategy in several levels of transmission intensity. Due to the noisy nature of our simulation outputs, we used a random forest model as an emulator to smooth the outputs. With this emulator, we calculated the number of cases averted and cost-effectiveness of vaccination programs. In this post, I will use our approximated model for the number of cases averted. To learn more about our research, go to the pre-print of our manuscript in BioRxiv. Also, check out our web-application to explore different scenarios with the model.
In this tutorial, I will use data from our model to estimate the number of severe cases averted with vaccination over a 30-year period. I will use this data in a Shiny App that controls the parameters of the model. Finally, I will describe how to publish this app online. This model and all the code from this tutorial is available to download on Github. To follow this tutorial, you will need to install:
1. The model
We first load the randomForest library, set a random number seed, and load the model.
Our model includes five variables:
The number of severe dengue cases averted depends on each of these parameters. For instance, vaccination would avert 12% of cases using an 80% sensitive and specific screening test, with a coverage of 80% in a setting with 70% of 9-year-olds with previous exposure to dengue virus. Try it yourself.
2. Designing the Shiny App
The goal of our Shiny App is to allow users to estimate the benefits of vaccination by exploring different vaccination scenarios. Our Shiny App consists of two components, the user interface and the server. The user interface receives the user’s input and sends it to the server to process. In this tutorial, we will use separate files for each function and save them as: ui.R and server.R.
2.1. The user interface
Our user interface consist of a side-bar where users can adjust the value of the model variables and a main section where the results are displayed.
Our application has three sliders to specify the transmission intensity of the region to vaccinate, coverage of the intervention, and the age of vaccination. We specify the sidebar in Shiny with the function sidebarLayout and sidebarPanel. This creates a space for us to locate the sliders for the user. Each slider is an input that the server will later use to create the main plot. The function sliderInput links the input value to the server with an inputId. Input values can be continuous or discrete (steps argument). The minimum and maximum values of the sliders correspond to min and max, respectively. The default value of the slider is called value.
Finally, we put these two pieces together in a fluidPage layout and assign it to a variable named ui.
2.2. The serverThe server reads the input and process it to generate the output.
In this case, the server reads the input values of the three sliders, renders a plot and assigns it to the plotAverted component of the output.
The server reads the sliders' input by using the inputID. For instance, age can be accessed using input$Age. The function plot_averted_heatmap uses the random-forest model and the inputs to create the main plot. In this plot, the X-axis corresponds to the specificity of the test and the Y-axis to the sensitivity.
So far, I have created two files, one for the user interface and one for the server. The function runApp() launches the app in a local computer. Without any arguments, the app looks for the files ui.R and server.R in the current directory. You can download the server.R and ui.R files from my repository on Github.
3. Publishing the Shiny App
Up to this point, you should be able to run the app on your computer. But, it might be not a good idea to share this app with others. To run the app your users need to have installed R and all the necessary libraries. Publishing the app to the web allows you to share your app with a wider audience. There are many ways to host Shiny Apps online. To start, I recommend using Shinyapps.io. It hosts web apps for free and the setup is straightforward. However, the free version comes with limitations. For instance, it allows only one user at a time. An alternative is to host Shiny Apps in your own server. If you are in a university, you could use a virtual machine provided by your university for free! In the next section of this tutorial, I will describe these two options.
For a more detailed guide, follow the tutorials for Shinyapps.io. You need to create a shinyapps.io account in https://www.shinyapps.io. After creating your account, go to the dashboard, click on the account icon, and select Tokens. Choose your token and click show, then click on show secret.
Copy the command and paste it in your R script.
Now, we need to install rsconnect package in R and load the library.
To deploy the app, you simply run the deployApp() command after setting your directory to the app directory. Your app should be in shinyapps.io/name/appname.
3.2. Setting up your own server
With shiny server you can host and manage your web application in your own server. Currently, shiny server is only supported in Linux systems. Detailed instructions can be found in the shiny server user’s guide. The following steps are in the command line. So, login to your server through SSH and continue below.
You first need to install R and the shiny package. This can vary depending on your system. For ubuntu you can use apt or apt-get.
You can specify a secure default CRAN mirror in the file ~/.Rprofile
Now, we need to install shiny. For this you can start an R session and type:
In ubuntu you can install shiny server using gdebi. For more information in this step, follow instructions in the shiny-server download page.
Before starting the server, you should edit the shiny-server configuration file to indicate the port for the server to listen and the directory of your app. This configuration file is located in /etc/shiny-server/shiny-server.conf. Change the location of your app in site_dir. By default, shiny-server points to /srv/shiny-server directory.
Now, we are ready to start our server. Simply type the command:
To stop or restart the server, type:
That's about it. Now, your Shiny App should be online. Visit your URL and make sure things are working as they should. If you need more help, the following links could be useful:
by Alex Perkins
I’m writing to highlight some work that lab postdoc Amir Siraj just published in a paper in BMJ Global Health about the potential risk of Zika virus (ZIKV) infection in Asia. In this paper, Amir applied methods to 15 countries in Asia that we originally developed in another paper to assess the population at risk of ZIKV infection in the Americas early in the epidemic there.
Some of the key findings from this new work include the following.
There are a lot of people in Asia, and a lot of them would be at risk of Zika virus infection IF they were all susceptible and a widespread epidemic occurred there. Ever since we first made our projections for the Americas, we have been interested in making similar projections for Asia. Both regions are inhabited by lots of people and are hotbeds of transmission for dengue virus, which is similar in many ways to ZIKV. For the Americas, our projections indicated that as many as 93 million people could become infected before the epidemic would burn out through the buildup of herd immunity. For Asia, Amir’s projections indicate that as many as 785 million would become infected, approximately eight times the equivalent figure for the Americas! On the one hand, the fact that this number is larger for Asia than the Americas is not surprising given how much larger the population is in the Asian countries we considered. On the other hand, it turns out that this projection is also quite a bit higher in Asia on a per capita basis than it is in the Americas. This suggests that, were ZIKV or some other ZIKV-like arbovirus to ever cause a widespread epidemic in Asia, it could be larger in scale than what we recently saw with ZIKV in the Americas.
Pre-existing immunity is clearly a dominant mitigating factor in the risk that Zika virus poses to Asia, but the interaction between pre-existing immunity and other factors is complicated. Although aggregate projections of millions of people at a regional scale are simplest to think about, we have always felt that the real value in our approach is in the projections it makes at a local level. After all, our methods do not account for spatial processes and are actually projections of epidemic size conditional on there being an epidemic in the first place. This is important to keep in mind because, while a continent-wide Zika epidemic is very unlikely to ever happen in Asia due to substantial pre-existing immunity (reviewed in this paper), local epidemics have happened in Asia and will continue to happen there over time. The situation we imagine in Asia is that, at some point in the past, many areas probably experienced a Zika epidemic that attained something along the lines of the size we projected, but since then the proportion immune to ZIKV infection has been declining as people alive during previous epidemics have died and others have been born. In Amir’s new paper, he applies some theoretical ideas to argue that a given level of pre-existing immunity should have a disproportionately large influence on reducing epidemic size in populations with relatively low transmission potential. Unfortunately, that means that populations with the highest transmission potential could still be relatively vulnerable to future epidemics, despite the presence of pre-existing immunity.
Now that there are a few empirical estimates of epidemic size from the Americas, we can tell that it looks like our original projections are holding up fairly well. There is still relatively little information about how many people have been infected by ZIKV in the epidemic in the Americas, but four local estimates have been published that can be compared with our projections. One appears spot on, two are within the range of uncertainty, another was quite a bit higher than our projection, and on average our projections tended to be a little lower than empirical estimates. We will be very interested to compare our projections to additional empirical estimates as they are reported, but for the time being we feel that this result reinforces the value of our projections for the Americas in the absence of more comprehensive data.
In addition to summarizing the key results, there are a couple of other important things to note.
While these results do help advance understanding of the population at risk of ZIKV infection in Asia to some degree, they should not be viewed as predictions of what we think will actually happen. For one thing, the Zika epidemic in the Americas and elsewhere has slowed down considerably over the last year. In fact, the bigger challenge at this stage is figuring out which areas will have any ZIKV infections so that vaccine trials can take place. In other work recently posted as a preprint on bioRxiv, we are trying to figure that out. In addition, the World Health Organization declared an end to the Zika Public Health Emergency of International Concern several months ago. Our feeling is that they were right to do so given how the immediacy of the situation has diminished but the long-term concern has solidified, especially given how little we know about what actually happened over the last few years in the Americas (more in this paper on that).
If this new paper does not reflect what we think will actually happen, then what good is it? Just because we do not think that these projections indicate what will actually happen does not mean that they cannot be useful. One way that these projections can be applied is to facilitate more realistic projections for a specific location following a serological survey. Another is to identify areas with the highest risk of a large epidemic, which could then be prioritized as targets for surveillance efforts or serological studies. Yet another is to provide early projections for this region in the event of a newly emerging disease with characteristics similar to Zika. This study alone will not answer all of the many questions about Zika in Asia, but nor will most other studies if considered in isolation. In the spirit of facilitating these and other applications that we cannot foresee, the code underlying this work and the detailed projections themselves are freely downloadable at http://github.com/asiraj-nd/zika-asia.
by Rachel Oidtman
Here, we will be hosting a (roughly) monthly blog, written by various members of the Perkins Lab, with commentary on conferences, teaching methods, random musings in the world of disease modeling / ecology / statistics / epidemiology, and more.
Once Alex and I agreed that a lab blog would be fun, and would not take time away from our research, we decided to have a lab meeting to see what the rest of the lab thought. During the lab meeting, we agreed on the goal of our blog as setting up a forum to communicate topics falling in our area of research (mathematical modeling of infectious disease) and other topics closely related to that. Another goal is to disseminate other interesting ideas to people both inside and outside of the academic world. Although we do hope the blog will serve as a tool for outreach (and we will write posts geared more toward outreach), this is not our primary goal in starting this blog.
Housed in an integrated Biological Sciences department, members of our lab have backgrounds that range from ecology to statistics and mathematics to engineering and geography. This diversity allows us to have both diverging and converging opinions, which we believe will lead to a constructive, interesting, and fun blog. With a different member of the lab contributing each month, you will get a taste of our interests, personalities, and writing styles. At the same time, we will aim for consistent quality of writing by having one other member of the lab peer review a blog post before we share it.
We are excited to start this venture, and hope you enjoy what’s to come.
Check out the slides from our lab meeting for more background on our motivations and goals.