Flickr, the online photo library, is awash in spectacular, high-quality digital photos… almost 4 million uploaded daily by some of the site’s nearly 90 million users, which came to 8 billion pictures as of last count… and counting.
As the image quality, and therefore bitrate, of digital imagery improved, this put Flickr in a bind: how could all those pictures be stored on their servers for users who paid either little or nothing for the privilege?
Fortunately, the site’s parent company, Yahoo, already had an answer. Yahoo had the largest Hadoop file storage and processing cluster in the world, more than 32,000 nodes with 600 petabytes of storage. But part of the point of amassing so much data is doing something with it, and image data is essentially unstructured, presenting Flickr’s data science team with a problem—how to analyze it to improve the user experience?
Fortunately, Hadoop, developed by Apache, was also part of the answer to that question. Flickr was able to use the same system to construct a deep learning environment on top of the cluster, leveraging all that storage and processing power to essentially teach itself how to comb through image data in order to analyze images for object recognition.
Using Hadoop, the team unveiled a new Magic View feature in 2015, which allows users to search photos using common terms like “snow” or “pumpkins” and get matching images even if they hadn’t been tagged as such by a human being. The Hadoop cluster had taught itself what a pumpkin looked like.
Hadoop Provides a Scalable Solution for Big Data
Data science has emerged as one of the fastest growing fields simply because there is so much data to store, organize, clean, process and interpret. Analyzing and making sense of all this data has become an industry in its own right.
But, for the same reason, storing and retrieving all of that information has evolved into another specialized silo of expertise. The sheer amount of bits to be crammed onto storage media began to outstrip the conventional capacity of individual hard drives in the late 1990s, despite steadily increasing capacity in each of those drives. The SLED, or Single Large Expensive Disk, began to fade as the preferred storage solution in favor of RAID (Redundant Array of Inexpensive Disks) on servers.
RAID had a number of advantages, including capacity and fault tolerance. But it had a disadvantage, too: performance. A RAID array could be configured to either allow rapid storage, rapid retrieval, or redundancy… but not all three. A single server only had the processing power to make so many decisions per second about which disk to put a bit on or pick it up from, and to “stripe” it across multiple disks to ensure integrity in the event of a failure of one of those cheap drives. The volumes of information that big websites were throwing at those servers overwhelmed that processing capacity.
Google, with its goal of indexing every single page on the web, struggled hard to meet this challenge. Programmers there realized that they essentially needed a version of RAID that worked with entire servers, not just hard drives. The Google File System was created to allow the company to use cheap, commodity server hardware to store buckets of data with nearly infinite scalability.
In 2003, the company released a paper describing the system. Three years later, in 2006, the open-source Hadoop project was created to duplicate the feat with open code that anyone could use, and it would bear the name of a yellow elephant plush toy that belonged to programmer Doug Cutting’s son.
The Anatomy of Hadoop: How It’s Put Together
Hadoop exists in a four-part architecture supporting two basic functions. The modules are:
- Hadoop Common – Essential utilities and tools referenced by the other modules
- Hadoop Distributed File System – The high-throughput file storage system
- Hadoop YARN – The job-scheduling framework for distributed process allocation
- Hadoop MapReduce – The parallel processing module based on YARN
Together, these comprise a distributed filesystem cluster (HDFS) and a software component to distribute and retrieve the data efficiently through the cluster.
HDFS, like most filesystems, doesn’t care what type of data it is storing. It can be structured, as in conventional RDBMS tables, or unstructured, as with noSQL key-value stores or even plain old binary files, like Flickr’s pictures.
The first challenge for data scientists is dealing with the unstructured nature of many Hadoop systems. The essential demands of speed compete with the overhead required to store data in useful relational structures, leaving data scientists to make up the structure as they go along.
Leveraging Hadoop for Data Analysis by Structuring the Data it Holds
As powerful as Hadoop is, it was designed to store and retrieve big chunks of data, not to analyze it sensibly. Data scientists find Hadoop data stores a challenge to analyze when the data is stored raw and unstructured.
Tools are evolving for pulling information out of Hadoop clusters:
- Tableau, a leading provider of visual data analytics software, has built its tools to interface with Hadoop stores right out of the box.
- Revolution R Enterprise is a tool from Revolution Analytics that allows native R code to execute directly on Hadoop clusters.
- Apache’s own Hive is a dedicated data warehouse store built directly out of Hadoop expressly to project structure on previously stored data.
And more are on the way.
Cloudera is working on a platform called IBIS to create a Python library designed for data scientists to analyze data in Hadoop stores. Hardcore Python coders can already use basic wrappers like Hadoopy and Pydoop to access Hadoop clusters directly, albeit without sophisticated analytical tools included. But the interface allows other Python data science libraries relatively easy access to the stored data for subsequent analysis.
Perhaps the easiest way to leverage Hadoop for big data analysis is simply to use it to store information in a structured format. With Hadoop used as a file store for an RDBMS system, data scientists have the advantage of essentially unlimited storage, high performance, and the ease of using familiar and fast tools like SQL and R without having to jump through additional hoops to impose structure on the data.