The petrochemical industry might not be the first to come to mind when you think about massive data stores, but there’s a lot of geographic and geologic data constantly being collected by the likes of Chevron and BP, which means they are building out their storage capacity and hiring data scientists just as fast as tech giants like Amazon and Facebook.
So there are more candidates for the top big data firms than you might expect, but it turns out they aren’t very easy to classify. Do you want to talk about it by revenue? By funding? By total numbers of records?
Sure, it’s usually about marshalling data to add to the bottom line, but not always. And even when it is, different companies in different industries are going about it in different ways. It’s a complicated topic to parse… but then, that’s exactly what data scientists do.
No matter what angle you take to slice the big data market, chances are it’s going to be a different list tomorrow than it is today. The players are flexible and the space is fluid. It’s an exciting time to be involved in big data.
Wikimedia Foundation
Including Wikipedia and it’s associated sites in the list illustrates how big data isn’t necessarily always about big profits. Wikipedia, the largest single property within Wikimedia, stores data in a clustered MySQL database derivative where it is organized by topic and language. The plain text of the articles is augmented by logs of all changes that have been made over time. Frequently accessed articles are cached for increased front-end responsiveness. Dumps of the data are stored separately to be available for download to anyone interested in duplicating the wiki.
So how do you figure the overall size? Do you measure compressed or uncompressed size? Or do you go by information content, the number of articles and density of topics? Do you include the caching or does it count as duplicate data? What about the RDBMS overhead? And what to do with the fact that thousands of new pages and hundreds of thousands of edits are made each day, each of them changing all of those numbers?
The Wikimedia Statistics team has given up on that impossible task and gone on to measuring more interesting statistics like the fluctuation and flow of edits themselves instead.
Similar issues arise with all of the entries on this list. But those challenges are exactly what data scientists are being hired to address.
Tencent
Tencent is a Chinese web giant that overtook Facebook in January 2018 to become the world’s fifth most valuable company… a company that few outside of China are familiar with. But with a native user base of almost 1.5 billion people that have been largely walled off from Western social media sources, the organic growth of Tencent’s data has been explosive.
With music, video, and online chat offerings that mirror those of Apple, Google, and Facebook, the company has enjoyed near-monopoly levels in the Chinese market and has leveraged that access to build out payment systems, gaming, and other data-intensive operations.
General Electric
Even if you already knew that massive consumer conglomerate GE makes more than lightbulbs, you might not have realized that it’s a big player in big data… and poised to become much bigger.
The multinational firm has divisions building everything from submarines to wind turbines, activities that generate plenty of data on their own, but it’s also a major investor in Internet of Things research. In a market expected to reach 20 billion individual smart devices by 2020, each generating reams of digital data every second, that makes GE a major force in data storage and analysis. The company’s Industrial Internet of Things Platform as a Service (PaaS) offerings have been experiencing 100 percent year-over-year growth, driven by edge analytics and artificial intelligence built out by company data scientists.
National Security Agency (NSA)
The National Security Agency made headlines in 2013 when analysts reported that its brand new Utah data center was expected to store up to a yottabyte worth of sensitive data, collected both openly and through illicit means, all toward the end of identifying and tracking down threats to the United States. Further estimates downgraded the likely capacity to only a couple of exabytes (only!) but the purpose remains the same, and the center is not the only place the agency stores data.
The NSA isn’t likely to ever admit exactly how much data it stores and processes, but their hunger for data scientists is no secret, and it’s a good bet all those new hires aren’t just sitting around twiddling their thumbs with nothing to analyze.
Acxiom
You’ve probably never heard of Acxiom, but the massive consumer marketing firm has definitely heard of you. Using advanced data mining techniques, the company quietly sifts some 50 trillion data transactions each year to analyze consumer spending habits and identify the most effective sales pitches to use for selling you everything from cars to credit cards.
As of 2012, the New York Times reports the company had amassed the world’s largest commercial database on consumers… a major accomplishment when compared to companies like Google and Facebook, which have consumers practically lining up to turn over personal data.
Chevron
Big oil is increasingly turning to big data to stay on top of a tightening petrochemical market, and Chevron leads the field according to INFORMS, the organization behind the popular CAP (Certified Analytics Professional) data scientist certification.
In 2015, INFORMS awarded Chevron its Best Analytics and Operations Research award for integrating advanced analytics into company operations. Blending sophisticated GIS data with distributed sensor networks and remote monitoring platforms in what the company calls the “i-field” digital oil field allows the company to boost recovery and production rates from 6 to 8 percent.
Facebook
Facebook primarily stores your data… or at least data about you, and all your friends, and all their friends. That includes conversations, likes and dislikes, pictures, music, and videos. It takes a lot of space but it also represents the largest part of value in the company, which is designed to figure out what makes you tick so it can help other companies sell you stuff.
The popularity of the platform has had the company building out new data centers like crazy and you can bet they are filling them up just as fast; in 2014, users uploading 30 billion items of content each day had amassed over 300 petabytes of data in the Facebook data warehouse.
Microsoft
Microsoft is seen in some circles as a has-been in cutting edge computing, but the Redmond, Washington-based company remains a giant in the big data world. SQL Server has been sitting at number 3 on the db-engines.com ranking, but experienced a resurgence in popularity in 2017 that ensures it remains a tool of choice for data scientists in every field.
Meanwhile, the company has quietly been moving into the cloud space with its Azure offering, and dragging users along with it on the Software as a Service Office 365 platform. The company hit the million server mark in 2013 and shows no signs of slowing down with its big data bet.
Amazon
Close behind Google is Amazon, amassing data not only on millions of products, but also on the people buying those products. Forays into online video and cloud storage have further upped the total amount of information stored in Amazon data centers.
Storing not only their own data but much of the operating data and data storage structure for other big data players like Netflix and Airbnb boosts the number enormously. The company had more than 450,000 servers deployed as of 2016.
Alphabet
Google’s parent company doesn’t just own the biggest Internet company in the world, which by itself houses the largest streaming video site, largest online document storage, and largest webmail service… they also own other big data plays like Nest and Waymo.
This easily makes Alphabet the biggest of the big data firms. Estimates for the total size of data stored under its auspices are somewhere in the order of around 15 exabytes.