Statistical analysis is the process of generating statistics from stored data and analyzing the results to deduce or infer meaning about the underlying dataset or the reality that it attempts to describe.
Statistics is defined as “…the study of the collection, analysis, interpretation, presentation, and organization of data.” That’s basically the same as the definition of data science, and in fact the term data science was initially coined in 2001 by Purdue statistician William S. Cleveland in the title to his paper “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.”
So you might say that data science essentially is statistical analysis, together with its ancillary technical functions for gathering source data and presenting the analyzed information.
Statistical analysis may be used to:
- Present key findings revealed by a dataset.
- Summarize information.
- Calculate measures of cohesiveness, relevance, or diversity in data.
- Make future predictions based on previously recorded data.
- Test experimental predictions.
Any data scientist will spend much of their day performing some or all of those functions.
Statistical Analysis as a Component of Data Science
Data science programs focus heavily on teaching statistical analysis techniques not only to provide students with particular tools for their future trade, but also to help imbue them with statistical thinking patterns. Data scientists look at data problems in a different way than most people, both because they have the tools to break those problems down in interesting and mathematically valid ways, and also because they have an advanced understanding of probability theory.
Probability theory doesn’t just find use in dusty classrooms and computer labs. It’s an exercise in thinking that changes the way you look at the world. By understanding the mathematical likelihood of certain statistical results, data scientists can tell at a glance whether they are seeing a new and interesting insight emerge from their analysis, or simply a natural distribution of results governed by the laws of chance. Moreover, they can perform the calculations on those results to prove whether they are random or relevant.
Statistical analysis, and the serious mathematical foundation it rests on, is one significant reason why data science is not simply a career that a talented college drop-out can learn through self-study and hard work. A master’s degree in data science, statistics, or mathematics is the only effective way to learn this vital subject matter.
Data scientists will learn the core aspects of statistical analysis in any reputable master’s program. These will include tools and techniques like:
- Bayesian analysis
- Judging conditional probability
- Data classification
- Linear regression
- Resampling
- Shrinkage
- Tree-based analysis
Most data scientists will also be exposed to the R programming language as part of their statistical analysis training. R is a highly specialized language for data analysis and has many built-in tools for statistical analysis and data visualization. It’s also highly extensible, so libraries can be created for even more esoteric statistical techniques.
Python, although it was not designed expressly for statistical analysis, is another language commonly used for that purpose. For data scientists, the NumPy library is a must-learn statistical package for the language that offers capabilities approaching those of R, while still allowing a more general purpose, high-performance language to be used for integration and analysis.
Learning to Avoid the Pitfalls of Statistical Data
It’s important for data scientists to be careful when using statistical analysis, and to rigorously apply the methods of testing. As Mark Twain was fond of saying,
There are three kinds of lies: lies, damned lies, and statistics.
This statement is just a more colorful way of pointing out the fact that statistics are both easily manipulated and easily misunderstood outside of context. Simpson’s Paradox is a classic case of this sort of failure. The paradox states that apparently clear trends visible in two or more groups of statistical data can disappear or even reverse when the data is combined.
A classic case of this occurred at the University of California, Berkeley, in 1973, when admissions figures showed that 44 percent of male applicants were admitted to graduate studies programs there, compared to only 34 percent of women. Administrators, worried about the potential for lawsuits over gender discrimination at the height of feminist movement, asked their statistics department to dive into the problem and find the source.
There was, in fact, a gender bias at UC Berkeley, statistician Dr. Peter Bickel found when looking more closely at the data from the school’s 101 different graduate programs. However, it was actually in favor of women, not against them. He found that women applied in greater numbers to programs with stricter admissions requirements. Even though they were admitted slightly more often than their male counterparts, when the total number for the college was produced, it produced a discrepancy because men applied more often to high-admission programs.
These types of mistakes make data scientists wake up in a cold sweat in the middle of the night, for fear that a mistake will lead to similar headlines in national media. In fact, many of the so-called statistical proofs that media seize on are meaningless or simply incorrect.
Data scientists are responsible for both presenting their findings in ways that do not lend themselves to such spurious news and also to explaining statistical analyses in terms that laypersons can understand. Although this is most obviously an issue when dealing with mass media, the problem exists in business and government just as frequently. Executives who rely on statistical analysis for decision-making have to be able to trust that the numbers aren’t misleading.
A solid master’s program in data science provides both the tools to perform statistical analysis and the knowledge required to apply those tools in trustworthy ways. Careful experimental design and regression analysis techniques will keep you out of trouble in statistics and ensure that you are pulling out gems of data and not polishing up fool’s gold.