In the final scene of the classic ‘80s hacker movie War Games, as the out-of-control WOPR supercomputer is about to crack the security code to launch a massive first nuclear strike against the Soviet Union, the main character (played by a young Matthew Broderick) hits on a simple idea: knowing that the computer has calculated that a first strike provides the best option of winning a nuclear engagement, he realizes that they somehow need to teach WOPR that there is actually no way to win. He asks the computer to play a game of tic-tac-toe, a simple game that is easily stalemated by either player. When it proves too slow, he tells the machine to play against itself.
In a barrage of blindingly fast matches, WOPR squares off against itself as both players in a simulation of the game. It rapidly runs through all permutations and then moves on to simulating nuclear exchanges. After tense seconds of analysis, it calls off the launch.
“A strange game,” it declares. “The only winning move is not to play.”
Although WOPR was a figment of the film-makers imagination, the final scene of the film captures the essence of machine learning: creating a set of conditions that allow computers to learn the rules of systems internally without being explicitly taught or programmed by humans. One of the best tools available for teaching machines to teach themselves is big data… and at the same time, one of the most promising tools for analyzing big data today is machine learning.
The Search for Artificial Intelligence Sparked the Idea for Machine Learning
Machine learning was created as a mechanism for advancing the science of artificial intelligence (AI). One of the most difficult aspects of developing artificial intelligence was in teaching the computer how to interpret and act on observations of data.
Initially, this was done through enormously complex branching algorithms, with human programmers attempting to envision all possible decision trees ahead of time and creating code to deal with those eventualities. But this severely limited the utility of even primitive AI, since the coding required became too complex for all but the simplest tasks.
Arthur Samuel, a researcher at IBM, realized that programmers could instead seed their AI programs with a very basic set of instructions, which did not attempt to explicitly imagine all possible logic branches, but instead instructed the computer to analyze and compare incoming data, and adjust its own processes according to the results. By observing and analyzing patterns in the data, the machine could effectively program itself.
In AI circles, this approach is sometimes compared to the way a human infant learns. By starting with rudimentary programming (movement, feeding, a desire for attention, needs for sustenance), a baby constantly observes and interacts with the world around it until it learns the rules we all live by: don’t touch things that are hot, don’t eat things that taste bad, don’t wet the dry and comfortable bed.
When this idea was first advanced in the late 1950s, the concern was that learning from existing data requires an awful lot of data. Finding patterns that make sense in every situation requires examining and testing many types of situations. And at the time, such huge sets of data simply didn’t exist.
Because of this, machine learning fell out of favor in AI circles during the 1970s and 1980s. But it was still there, waiting on the shelf, when Internet adoption began to rise in the 1990s and 2000s to generate reams and reams of raw data.
Data Science Finds a New Use For an Old Tool
The fact that machine learning was rooted in statistical analysis didn’t escape the notice of early data scientists looking for ways to make sense of all that information. Pattern recognition was a useful tool even if you weren’t trying to build a WOPR—the same machine learning processes could unearth trends in search engine queries or retail sales, trends that might never be easily discovered at more human scales of thought.
Data scientists today program machines to learn from large sets of data to make predictions about results that are either derived from that data or from new information that is constantly flowing into the machine. A good example of this is the way that Google uses machine learning in its RankBrain search optimization system.
The RankBrain algorithm is used to refine queries asked of the search engine by making connections based on words used in past searches and how they may be connected. With more than three billion searches made through Google Search each day, and information about what links were clicked in the results, there are many hidden patterns in the data. RankBrain is trained to use that information to prioritize results that should be more relevant based on the responses of previous searchers.
Data scientists apply machine learning to a number of different problems of statistical analysis, including:
- Classification – Systems train themselves to categorize data points into buckets.
- Regression – Continuous output toward a particular solution or answer.
- Clustering – Similar to classification, but the machine is tasked with defining the buckets for grouping data itself.
- Density estimation – Determining the distribution of inputs in a data set.
- Dimensionality reduction – A type of classification where the machine learns to generalize and map data to broader categories.
These methods may or may not require the data scientist to define a particular target for the machine to learn. They typically include some reinforcement mechanism, such as scoring against human classifications of the same data. Every time you clarify an image in Google’s reCAPTCHA challenge system, you are adding a score in a data set that is using machine learning to train image recognition algorithms.
How Machine Learning is Taught and Used in Data Science
Machine learning is not a tool that all data scientists will use. In fact, machine learning scientists are emerging as a separate category of data scientist. They specialize in programming machine learning algorithms, requiring a solid understanding programming languages, like Python, and a strong basis in computational learning theory.
Statisticians also have come to rely heavily on machine learning, combining the two fields into a subset specialty they call statistical learning. This approach relies heavily on frequency and is finding applications in:
- Computer vision
- Speech recognition
- Bioinformatics
Because of the crossover with artificial intelligence applications, machine learning is as likely to be taught in computer science programs as in data science programs. However, most data science master’s degrees will at least include an introduction to the topic, and many offer electives that can take you more in-depth on the subject.
While data scientists often use machine learning techniques to perform data analysis tasks for themselves, they may also be involved in machine learning efforts that are more oriented toward artificial intelligence development. While AI researchers benefit from having very large datasets for training their AI algorithms, they may not have any expertise in creating or managing those datasets, making data scientists valuable members of AI teams.
In that role, they may be responsible for defining collection procedures, scrubbing data to ensure inputs will be relevant, and assisting in creating interfaces between machine and data.
Understanding the capabilities that machine learning offer is important to all data scientists, however, even if they are not directly involved in coding machine learning algorithms. The technique is an important tool in the profession and even if specialists or consultants have to be brought in on a specific project, it can be the right approach for certain data analysis problems you might encounter in your career.