Analysis of vast amounts of data is proving an enormously powerful tool in understanding and predicting trends in everything from online shopping to healthcare. Some of the insights gleaned from that analysis are surprising and even counter-intuitive, but sometimes they are misunderstood by people outside the field.
As with any highly technical trend, the media tend to inflate some of the capabilities of big data analytics. Headlines regularly report astounding findings generated from massive data sets, but a lot of it seems like a stretch…
- A woman’s bra size is correlated to online spending
- Taking the train is the happiest way to commute to work
- Walmart is the Missed Connections capital of the United States
- Used orange cars are less likely to be lemons
But data scientists know that these correlations are not necessarily predictive, or even particularly unusual. They may, however, be very useful simply by demonstrating connections that are just slightly more likely than completely random. The biggest challenge in cutting-edge big data analysis today might be in finding exactly where that line is drawn.
Spurious Correlation or Hidden Gem? … Sorting the Wheat From the Chaff in Data Analysis
The truth is that there are a lot of correlations that can be identified in very large data sets, and not all of them are predictive or even meaningful. Much can depend on the data set itself, how it was collected, and how the analysis is framed. Spurious correlations, or connections that appear in data but not necessarily in a causal relationship in reality, are one of the major potential pitfalls of big data analysis.
Big data is particularly susceptible to this sort of problem because the rise in the number of variables being examined also leads to a rise in noise—data that ultimately has no relationship to the focus of investigation. Noise can indicate deviations from the purely random when it is not, in itself, random. And because data collection mechanisms force choices at some point, whether made by the data scientist or someone else, the noise in big data sets is frequently non-random.
To paraphrase Nate Silver, editor-in-chief of FiveThirtyEight, a website pioneering data-driven reporting: more data means more opportunities to be wrong.
By using larger and larger data sets with more and more variables, the odds of finding spurious correlations rise. Guarding against being distracted or deceived by these false correlations is a basic rule of data analysis, but it’s one that is being broken more and more frequently.
Believe it or not, this is good news for data scientists. It means that just crunching numbers, which any computer can do, is among the least important parts of data analysis. More important is the very human aspects of defining data collection procedures and developing theories of analysis.
And possibly the most challenging aspect of that part of the process revolves around defining what random actually is.
Defining Random Is The Challenge
Data scientists often look for results that are just slightly better than random. Given a large enough data set, that subtle difference can be worth tens of millions of dollars. Low-hanging fruit, the easy-to-spot correlations, don’t usually require intensive analysis. A 2016 Gallup study that found that more engaged employees tend to lead to safer workplaces—an excellent confirmation of the general hypothesis that paying attention to what you are doing keeps you from screwing it up, but a result that most data scientists would have been wasting their time studying.
Instead, data scientists are paid to find the needle in the haystack, the positive correlations that can’t easily be inferred from broad overviews or rudimentary analysis. And those results, necessarily, find themselves resting closer and closer to the line of true randomness the more subtle they are.
But defining what is truly random has been a problem in computer science, and indeed in mathematics generally, for a long time. It turns out that there are subtle patterns in a lot of data, even when those patterns are essentially meaningless. Most random numbers in computer science are not truly random and can exhibit predictable patterns based on the algorithm and seed that generate them. For true randomness, programmers have to turn to phenomena such as weather observations or the brownian motion observed in lava lamps.
Randomness means a lack of predictability in results. Predictability, of course, is exactly what data scientists are looking for in their analyses. A just better than random result makes for just barely predictable events… things that occur just slightly more often in real life than would be true if they were governed by pure chance. Whether it is a woman who buys large bras spending a few dollars more at each shopping session or used orange cars being slightly less likely to be lemons, the causation is often unimportant— in the end, the information can be used to add a few more dollars to the bottom line, so it serves the end goal.
But not so fast… deciding exactly where that random edge is depends a lot on both the underlying data set and the set up of the analysis. And for one of those results, it turns out that it probably was just random chance after all. The survey of used cars that turned out to be lemons failed to account for sample size issues that render the prediction meaningless. (The jury is still out on the bra-size to online spending correlation).
As data scientists find themselves working with larger and large data sets and working harder and harder to find results that are just slightly better than random, they will also have to spend significantly more time and effort in accurately determining what exactly constitutes true randomness in the first place.