Data mining describes the process of using computational methods to extract meaning from large sets of data.
Data mining is one of the core processes that data scientists use to leverage new insights from existing data structures. This is an area where you’ll need a strong understanding of query languages, database structure, and analytical techniques, all of which you’ll find in the curriculum of a data science master’s program.
If that definition of data mining sounds suspiciously as if it simply describes most of the work performed by data scientists, you’re not wrong—the term has become muddy and imprecise with constant use, similar to how the term “cloud computing” has simply come to describe almost anything that can plausibly be connected to the internet. Today, data mining is casually used as a description of almost any type of data analysis.
Data Mining Led to the Emergence of the Modern Field of Data Science
The origination of data mining in the ‘90s is likely one of many developments in the database world that directly led to the data science profession.
Data mining first emerged as a descriptive phrase in the 1990s, when relational database systems were becoming large enough, and holding enough raw data, to make greenfield analysis projects both possible and potentially lucrative.
The systems had previously primarily been used for operational data management; the information was stored and retrieved for particular and pre-defined business purposes, like tracking sales transactions or time clock information. But database administrators and business analysts realized that it would also be possible to create new reports and compare the information in new ways that could provide insights into business efficiency and functions that wasn’t originally envisioned. The process of querying and crunching the stored data to leverage new insights out of it came to be called data mining.
The word “mining” in the term didn’t refer to the creation of new information, as it might imply to laypersons, but rather the discovery of hidden patterns that might be found in existing data. Like digging down through rock formations to uncover gold veins that existed within them, unseen, data miners crunch through boulders of retail data looking for blocks of customers grouped by recency, frequency, and monetary spending habits for targeted marketing efforts that will pay off even bigger than a gold strike.
Applications of Data Mining
Data mining is used in almost every application of data science. Any data research project that involves working with large, existing databases, or even non-traditional datasets such as the open web or real-time streams, involves mining data.
- Biologists are working with data scientists to perform data mining on mapped DNA sequences, probing the vast amounts of genetic data for relationships and patterns that can help establish the evolution of species or the origins of diseases.
- Almost all retail businesses engage in large-scale data mining operations today to understand implications found in their marketing and point-of-sale system data for customer preferences and profitability.
- Spatial data mining is performed on large geographic information systems (GIS) datasets to detect trends in economics, social behavior, and transportation from mapping data.
- In national security and law enforcement, data mining is used to probe data sets to uncover patterns of behavior that can be associated with criminal or terror activities.
The limits of data mining projects exist only in the imagination. Much of the role of data scientists in any organization is to survey the available data stores and design data mining projects to exploit the information in them for better decision support or increasing insights into business efficiency.
Defining and Learning Data Mining for Data Scientists
Because the process of data mining is so central to what most data scientists do, almost every aspect of data science education can be seen as a component of the data mining process. It can be difficult to view data mining as a separate topic from other discrete techniques, such as:
- Machine learning
- Statistical analysis
- Data visualization
Each of these has some role to play in the data mining schema. Different schools may define data mining differently and emphasize or teach different parts of the data science process under the data mining rubric.
An Old Organization Advances New Standards for Data Mining
The Association for Computing Machinery (ACM), a more than 70-year-old trade group whose slightly archaic name betrays its age, established a special interest group (SIG) in 1998 dealing specifically with knowledge discovery and data mining (KDD). SIGKDD proposed a formal model curriculum for data mining that helps to define the process and dictate how modern data scientists learn how to use it.
The curriculum has a good breakdown of the elements of data mining, splitting it into foundational and advanced topics that include:
- Pre-processing (data scrubbing and transformation) considerations and methods
- Normalization, reduction, and data smoothing
- Data warehousing
- Statistical and clustering analysis
- Text-mining, stream-mining, and data visualization
- Societal and ethical considerations of data miners
Influenced by the SIGKDD model, many data science master’s programs are beginning to teach data mining in a similar vein.
In particular, considerations around ethics and propriety are becoming a focus as various abuses of data mining have come to light in recent years. Because most of the current sources for data mining were gathered for other purposes initially, there is an issue in some cases over whether or not there was consent provided for the new analyses, or if consent is even a requirement for data mining.
In some cases, such as with impersonal, aggregated data like GIS information or anonymized medical data, there are no issues. In other cases, however, such as with the NSA’s Total Information Awareness program, the data consists of very personal information, gathered by third parties purportedly for much different purposes than it was ultimately used for. This has caused not only public outcry but some soul-searching for data scientists engaged in data mining.
The technique is so powerful and useful that it is not going away any time soon, however. The data science world will be working away down in the data mines for a long time to come, chipping out nuggets of knowledge and insight, for decades to come.