In Thailand on his summer break in January of 2018, Australian university student Nathan Ruser couldn’t quite distract himself completely from his studies. An international security and Middle East studies student at Australian National University in Canberra, Ruser found himself browsing Twitter to check out mapping and GIS projects during some down time between touring the Thai-Myanmar border. There, he happened across a Tweet from mobile fitness app company Strava.
The company had put together aggregated and anonymized tracking activity from its user base into a colorful global heatmap, showing the most popular routes followed by walkers, bikers, and joggers who use the app. The post announcing the map was written by a data engineer and is a marvel of data processing and visualization tips and tricks, enough to make any data scientist green with envy. You can almost see the swagger as the engineer cites the billion activities tracked and trillions of pixels that were rasterized in the visualization, going into salubrious detail about how the visualization was produced from the massive dataset behind it.
But what Ruser noticed wasn’t the technical detail, but the tracks on the map itself. There were bright traces of activity in a lot of parts of the world that he typically studied: war zones. The average Syrian irregular or African rebel probably wasn’t wandering around wearing a Fitbit or Apple Watch. The tracks were almost certainly left by Westerners… and most Westerners in those regions were military forces working on classified missions out of secret bases.
Bases that weren’t so secret anymore. Big data had just spilled big secrets… and potentially cost some soldiers their lives.
Anonymising One Data Source Doesn’t Make The Data Anonymous
Ruser’s observations went viral on Twitter within days, prompting a defensive response from Strava which blamed users who couldn’t figure out the seven step process required to secure their data. The company claimed that it had followed all best practices to anonymize the data, even though researchers jumping on Ruser’s analysis quickly found ways to pick out individual records, including following one French soldier through his entire deployment and return home.
The breach prompted one Twitter user to observe, “The data science version of ‘I’ll go check the basement’ in horror movies is ‘Oh, it’s OK, we anonymised the data.’”
That observation may fall into the funny, but true, category. The Strava debacle exposed secret bases, convoy routes, and potentially even guard posts and patrol patterns… all military intelligence of the highest value to opposing forces. But it’s not the first time that supposedly anonymized, open source databases have been used to uncover illicit activities.
Google Maps has been used almost since its inception to check out overhead imagery of suspected military or intelligence bases. Initially, the company simply published the satellite shots it received from commercial imagery providers (a feat that required no small amount of big data crunching in the first place), but soon national security agencies began demanding sensitive regions be censored. Then some researchers figured out that you could spot unacknowledged, but sensitive, facilities just by looking for places with missing map data—Wikipedia has a convenient list—and now those areas are blurred out less obviously.
But for bases in foreign territories, there’s no hiding from the big eyes in the sky.
Although the Strava incident wasn’t the first case where big data spilled the beans, the bigger problem is that the more such incidents occur, the more they may tend to corroborate one another. As Ruser was quick to point out, simply finding a Fitbit track in the desert was not absolute proof it was worn by a soldier or indicated the site of a military base—but the chances were high enough that most would bet on it; other explanations just seemed unlikely.
But by combining the Strava data with other big data sources, other security researchers, such as Marcus Ranum, were able to quickly narrow down those possibilities to the point of near-certainty. And if a handful of security researchers browsing Twitter in their spare time are enough to compromise national security, what exactly is happening among the professionals being paid to discover such compromises?
The Incidents You Hear About Are The Tip of The Iceberg
This is big data used against itself, with supposedly anonymized information being combined in such large comparative sets that it is, in fact, no longer anonymous or private. Bellingcat, a privately run open-source intelligence gathering site, regularly performs such analyses, and also provides techniques that anyone can use to do so, such as performing image verification automatically against the entire YouTube video catalog.
And the public exposures may only be the tip of the iceberg. Ruser, a student, was the first to point out the Strava problem. But at that point, the data had been in the wild for more than two months. Professional intelligence agencies have dedicated units that look for such open source intelligence scoops, and there are excellent odds that the information had already been long-compromised before Ruser tweeted it out.
Combined with other datasets that may only be available to intelligence services, the Strava data could be even more revealing.
And though Strava is being criticized for releasing the information they held, there is hardly any guarantee it would have been safer if they had never published their heat map. Hacking is an endemic condition for individuals and businesses alike today, with 780 reported data breaches in 2015 leading to the loss of 177,866,236 personal records according to the Identity Theft Resource Center. Those are only the reported breaches—many more companies attempt to cover up security incidents, like Uber did with a breach in late 2016.
Many of those hackers are criminals but others are state actors, presumably looking for exactly the sort of derived intelligence that the Strava leak exposed.
Data scientists who are not considering the implications of collecting and holding very large data sets are failing their ethical obligations to the profession and quite possibly damaging its future.