It could be said that the modern field of epidemiology owes its very existence to data science.
In August of 1854, the city of London found itself swept by a terrible epidemic of cholera. Over the course of three days, following a series of lesser outbreaks, 127 people died, mostly in the area of Broad Street. Within a week, 500 were dead and 75 percent of residents had fled the city.
A physician named John Snow thought it was suspicious that the deaths were so highly concentrated on Broad Street. Popular theory at the time attributed the transmission of cholera to an airborne vector, but Snow found anomalies that caused him to question the idea… no workers at the nearby Broad Street Brewery contracted the disease, for example.
Plotting the homes of the afflicted and using statistical calculations, Snow identified the most likely source of infection as the nearby Broad Street water pump. He prompted the local authorities to remove the pump handle. The outbreak stopped.
And the beer-drinking brewers that also used the Broad Street pump as their water source? The simple act of boiling the water as part of the brewing process killed the bacteria.
Snow’s investigative methods smacked of the methodical approach a data scientist might have taken when faced with the same problem. Epidemiologists and the modern public health apparatus adopted both Snow’s methods and motives. But their capabilities for analyzing the information met major obstacles before the computer age.
Today, the field of epidemiology is a refined science, but it’s in dire need of more master’s-prepared data scientists to help power it.
Fast Access to Accurate Analysis Helps Epidemiologists Save Lives
Modern data science offers epidemiologists a marked advantage in the single most important factor in tracking the spread of an outbreak: speed. Every passing day that data sits in storage without being processed leads to the loss of life, as was described in the Harvard School of Public Health Magazine.
However, it’s not just the speed with which multiple data sources can be integrated and analyzed with data science techniques. It’s also the accessibility of the resulting analyses. Only rarely do public health officials perform their most critical work in the safe and cloistered confines of offices, with powerful computers and high-speed Internet access at their fingertips.
Using the Ubiquity of Smart Phone Technology to Contain Outbreaks
More often, they find themselves in the field, at hospitals or ad-hoc testing and diagnostic sites in the field, or interviewing suspected carriers or victims in their own homes. By taking the results of data analysis and offering it immediately via mobile applications in easily-digested visual formats, data scientists are getting the most up-to-date data into the hands of the professionals who can use it to the greatest advantage.
During the 2014 Ebola outbreak in West Africa, authorities rapidly developed the Ebola-Info-App for mobile phones, pushing it out to health care providers on the ground in the region. Available in both French and English (the medical lingua fraca) the app provided key information about the spread of infection to help coordinate the response among the many agencies working to contain the outbreak.
Getting Ahead of Disease Progression with Predictive Analysis
Traditionally, public health workers have been able to work only retrospectively, acting on information that comes in after crises are already underway. It takes time to draw blood, get lab results, and compile and interpret the information. Epidemiologists are sometimes days behind the curve when trying to assess the spread of disease out in the field.
But data scientists are changing all that. By uncovering unexpected relationships in disparate sets of data, they are uncovering correlations that allow public health officials to work more proactively.
MINEing Data for Epidemiology … Not the Mine You’re Thinking Of
In one example, a tool named MINE (Maximal Information-based Nonparametric Exploration) was used to crunch through a World Health Organization database containing more than 200 potential corollary and causal relationships from 20 different countries to identify previously unknown relationships between female obesity and income level.
Armed with such tools, public health workers and doctors can anticipate problems before people ever engage directly with the health care system in an effort to correct them before they become chronic.
And as with other fields where data science has become common, investigations in public health are starting to become driven by analysis, rather than the investigation driving the analytical process. When data collection and analysis was difficult and time-intensive, scientists had to generate a hypothesis and then work specifically toward proving or disproving it with dedicated data gathering programs.
Today, rapid and often visual exploration of data can lead to new hypothesis directly, taking researchers in directions they might never have considered previously.
New Sources of Data Broaden Options for Public Health Officials
Modern technology isn’t just providing new ways to look at existing data. It’s also affording doctors and scientists new sources of data for investigating public health problems.
The most famous example of this might be Google’s Flu Trends. By analyzing search trends, data scientists were able to accurately plot flu outbreaks well before they became rampant. When users would make search queries about symptoms that had yet to become significant enough to bring to the attention of medical professionals, they were actually giving clues that indicated they were in the early stages of the flu. This innovate approach to disease tracking provided a two-week lead on traditional influenza outbreak tracking.
Using Clues Left in Social Media and Other Sources to Identify Risk Factors
In the future, data scientists might not only look at such inadvertent self-reporting, but also to other information posted publicly to social media or other sources that could simply describe risk factors rather than actual infection. During the 2014 Ebola outbreak in Africa, the International Telecommunication Union rapidly developed methods to track movements and Call Data Records in the outbreak area to assist epidemiologists in developing spread models to anticipate where the disease would head next.
All these different data sources that are now ready for comparison present a challenge to data scientists, however. Researchers and public health agencies have been collecting data for decades and sometimes centuries. The formats they use to store the information are often wildly incompatible, and the accuracy of older data is sometimes suspect.
How Data Science Will Contribute to Cancer Prevention
As described in the Harvard School of Public Health Magazine, a new effort dubbed “Investigation-Study-Assay” aims to help fix this issue. ISA offers a common data model with metadata attributes for tagging that can be used to store almost any sort of health-related information. A software suite allows data to be imported and converted between the ISA format and a variety of popular database systems.
Already, the system has paid off, giving investigators the tools they need to discover a relationship between cancer and stem cell data that has allowed them to identify the genes responsible for certain cancers.