How Big Data Came to Be a Bigger than Big Oil Almost Overnight

10 years ago you would be hard pressed to find a degree or certificate program containing the phrase “data science” anywhere. Today there are no less than 170 college degree and certificate programs in data science, one of the fastest-growing fields.

Featured Programs:

Data Science Back in the Day

It’s not exactly true that data science didn’t exist 10 years ago. From a certain perspective data science is as old as applied statistics. But what is drastically new is computing power, data storage capability, and the means of organizing data.

Dr. Kapoor describes it as a broad field made up of four components: data, programming, statistics, and operation research science.

These components aren’t exactly new. Dr. Kapoor explains that 10 years ago at his university they had an MBA concentration in what was then called business intelligence and eventually came to be called business analytics. This covered subjects like applied statistics and modeling. Indeed, many components of what has become known as data science have been around for decades.

By looking back at the tremendous growth of data science’s seed programs – things like statistics and business analytics – we can actually track the emergence of data science into how we know it today.

In 2016 the data visualization company Tableau published a landmark study that cataloged the emergence of new data science seed programs across the nation. It starts by noting the growth of new programs in statistics, which two decades later gave way to a surge in business analytics/intelligence programs and the emergence of data science as a full-fledged field in its own right. As you follow this progression, keep in mind how technology was also developing during these periods:

1980-1989: The number of statistics programs being offered at universities grows by 90%. Programs in decision sciences and business analytics are in their infancy, while programs in data science and data mining are non-existent.
1990-1999: Statistics still account for the vast majority of data science seed programs at 77%. Programs in informatics have made the next strongest debut at around 20% of new programs, while there are still a few programs in business analytics that have held on.
2000-2009: During this period new programs in informatics have caught up with new statistics programs, with each accounting for about 41 percent of the data science seed programs. The number of new business analytics programs have grown since the last decade, and programs in data mining and analytics have come into existence and are starting to take hold.
2010-present: Business analytics/intelligence programs have gone from representing a small portion of degrees in the field to surpassing statistics and analytics to account for 32% of all programs in the field, more than any other niche. Statistics make up 20% of new programs while informatics make up 17%. More traditional analytics and data analytics programs still account for a significant percent of programs, but they have already been eclipsed by a new field that has appeared in the past decade and already accounts for 13% of new programs: data science.

And the rest is history. There’s no doubt that if we measured the emergence of new data science programs in the last half of this decade it would eclipse all of its previous seed programs.

The critical component of data science that is not tracked on the progression above is technology. Mass data collection/storage and computing power, the remaining critical components data science needed, made exponential strides between 1980 and the present. These ingredients combined at the right time to be baked into the delicious cake we know today as data science.

But while the emergence of data science has been fascinating to watch, this still doesn’t explain exactly why this field is so lucrative.

The Monetization of Big Data

You don’t have to look any further than Facebook to know the single-biggest way data is monetized is through targeted advertising. We might take this for granted now. Today in the back of our minds we know data about us and our behavior is being gathered every time we use Facebook, search for something on Google, stream a video on YouTube or shop for something on Amazon.

But the techniques developed by those companies’ proto data scientists just a handful of years ago are relatively recent innovations that are being improved on almost daily. As Dr. Kapoor explained:

“Data science starts with collecting data and storing them. Google and Amazon gave us the fantastic means to store data in a way that it can be accessed very fast. You can parse it. You can analyze it. The techniques that these few companies gave us opened up all the possibilities for being able to take large volumes of data – even if it is scattered in different places – and be able to retrieve the information that you need from all different locations. You have the information – you parse that information that you need – and you’re able to analyze that very quickly because the processing power is very fast now.”

The revolution in the monetization of data came when tech and social media companies figured out how to pair a user’s personal information and self-generated content with advertising. That may sound simple. But this revolution has only been possible on a massive, thorough, instantaneous, automated scale in the last several years, and is only possible because of the field of data science and – specifically its algorithms.

20 years in the future data scientists will look back on the relatively primitive algorithms we are using now. Many of today’s fundamental algorithms for data mining have only been seriously implemented in the last five years:

Artificial neural networks have only just started to take off since 2009, when computer scientists like Jürgen Schmidhuber made breakthroughs that were eventually incorporated into Google algorithms in 2015.
The first algorithm that was able to identify and group like items (cluster analysis) while simultaneously effectively filtering out noise – BIRCH – proved to be reliable only in 2006.
2011 marked the first time that algorithms have been created that accurately develop hypotheses (Bayesian interpretation) to explain relationships between related groups of variables (regression analysis).
Fundamental machine learning techniques like predictive classification of variables into decision tree models (gradient boosting) developed by people like Leo Breiman have only started to be perfected and applied widely in the last decade.

It’s with new algorithmic developments like these that data science has enabled just two companies – Google and Facebook – to capture allof the $32.7 billion growth in digital advertising spending in just the first half of 2016.

That illustrates one of the reasons why this field is so exciting. Still in its relative infancy, we can expect further radical changes in direction and developments that will revolutionize the data science field in ways we can’t even conceive of now.

How Big Data Came to Be Worth More Than Big Oil Overnight

Data Science Back in the Day

The Monetization of Big Data