The possibilities unleashed by the rapidly expanding ability to organize and analyze biological data continue to spin off new and fascinating disciplines almost weekly:
- Genome annotation, to mark biological features in a DNA sequence
- Comparative genomics, to attempt to establish relationships between genes in otherwise unrelated organisms
- Structural bioinformatics, to attempt to predict protein function from the genetic sequence
Many scientists believe that the most spectacular developments will occur in the next fifty years or so, propelled, no doubt, by students entering data science master’s programs today.
Big Data, Big Discoveries in Genomics: From Mendal to the Human Genome Project
Many of the biggest breakthroughs in bioscience have come courtesy of an intensive study of large data sets. Anyone who has sat through a high school biology class can tell you that Gregor Mendel is the father of modern eugenics. Some people might vaguely recall that his famous pea plant experiments involved a rigorous, multi-generational analysis of almost 30,000 individual plants. Very few, though, take the time to reflect on the enormous task of manually collating and analyzing all that information, as Mendel was forced to do by hand in 1863.
More recent discoveries have come from even larger data sets and more complex analysis—far beyond the capacity of even so dedicated a scientist as Mendel, or of any human being in a single lifetime for that matter. Today, the ability to compile those massive sets of data and to develop methods to find the statistical relationships within them is entirely in the hands of data scientists using ever more advanced DNA sequencers and ever more powerful computers.
The advances we’re seeing in biotech would not have occurred without massive computational processing power and the contributions of professional data scientists.
Consider this: DNA (Deoxyribonucleic acid) was first identified in 1869; its molecular structure first deciphered in 1953. The first full DNA genome was sequenced in 1977. That genome belonged to a simple bacteriophage and the sequence contained slightly more than 5,000 base pairs. In 2004, the first complete human genome was sequenced, comprised of more than 3 billion nucleotides.
The jump from 5,000 to 3 billion was made possible by Applied Biosystems’ automated DNA sequencer, first developed in 1986. Even with modern computing power in play, analysis of the 3 billion nucleotides that 80’s model sequencer identified is expected to continue for decades.
That single human genetic sequence, in purely informational terms, boils down to about three gigabytes of data—not an enormous amount, by modern standards, a little larger than the size of your favorite movie encoded as a high-definition video file. Yet it took those early automated sequencers 13 years to produce that data.
Consider the fact that as of 2007 the cost of decoding a single genome was in the neighborhood of $10 million, while fewer than ten years later in 2016 we’re fast approaching the $100 price point, and the industry believes they are not far from being able to sequence an entire genome in a matter of hours.
As our ability to sequence genomes increases at a rate fast enough to make Moore’s head spin, it has brought the loftiest hopes of the Human Genome Project into reach. According to the Human Genome Project, most of the major practical advances in the use of genetic data to improve the lives and longevity of individuals are expected to come from analyzing the DNA of those very individuals. And there are 7.4 billion of us.
A Healthcare Revolution: Disease Prevention at the Genetic Level
Today’s third-generation sequencers can produce up to 60 gigabytes of data a day at a cost of around $70 per gigabyte. At those rates, enormous amounts of data can be produced – perhaps holding the key to such grails as the origins of and cures for fatal diseases, the secrets of aging, and the roots of human brain function.
And someone is needed to unlock all the mysteries buried in that information: data scientists.
Unlike in Mendel’s day, when squinting at plants and taking careful notes generated the information that would be studied, today advanced instruments handle much of the grunt work of data collection and collation. Yet the data collected is still worthless without human analysis. Data scientists are responsible for determining the rules of collection, the methods of study, and the interpretation of results.
The challenge is enormous but the benefits are incalculable.
The ability to detect risk factors in individual genetic data is already allowing some fortunate people to take steps to prevent or avoid serious chronic disease, and scientists are already looking into the possibility of directly editing DNA to eliminate risks at the genetic level.
Future advances will fuel improvement in human lives through the development of new drugs, improvements in sensor technology, and the detection of trigger factors for diseases, which today still lie cloaked in an impenetrable stew of protein chains.
Data Science in Biotechnology: Beyond Genomics
Over the next few years, graduates of data science master’s programs can expect to be at the forefront of world-changing advances in biotechnology. Sequencing and analysis of individual genomes has become the face of data science in biotechnology, but it is only the start.
Bioscience will impact agricultural development and sustainability in a world where expanding populations put increasing demands on the food supply and allow biologists to comb through millions of seeds in search of biodiversity. Efforts like the National Institutes of Health’s BRAIN Initiative will drive a deeper understanding of neurological disorders and the function of the human mind—revealing, perhaps, the very basis of human consciousness.
The collection and use of Big Data has become more the norm than the exception in modern life and there are opportunities for exciting data science careers in almost every industry. But there are few fields that data science both drives forward – and is driven forward by – to the extent to which this is true of biotechnology.