A pair of German researchers have shown just how easy it is to identify individuals and track their internet browsing habits in detail using supposedly ‘anonymised’ data sources.
The research, presented at DefCon in Las Vegas by journalist Svea Eckert and data scientist Andreas Dewes, sheds light on the practices of companies that collect user data, either for their own purposes or selling it on to third parties.
The data is, in theory, stripped of identifying information before being used, but Eckert and Dewes found that in some cases as few as 10 web addresses is enough to identify who the ‘clickstream’ belongs to.
Once the individual has been identified, often by matching particular information in the clickstream to data that’s publicly available, the stream indicates everything that user has been doing online, minute by minute, Eckert and Dewes said.
They found details of an ongoing police investigation by examining Google Translate URLs, in which which are stored the full text of any query, after matching one clickstream to a particular police detective.
In many cases identity could be established by examining particular URLs – for instance, if someone logs into their own analytics page on Twitter an address is generated that contains their own username.
In other cases the clickstream might indicate the user visited a particular site, at a particular time, say, a YouTube video, when the individual has mentioned looking at that video on a publicly visible source such as a blog.
“The increase in publicly available information on many people makes de-anonymisation via linkage attacks easier than ever before,” the researchers said the presentation.
The data was surprisingly easy to obtain, with 95 percent of it coming from 10 popular browser extensions. Such extensions offer users a service, but also monitor everything they do online and use it themselves, for purposes such as targeting advertisements, or sell it to third parties.
They estimated up to 10,000 extensions collect detailed user data, but most have a relatively small user base.
“When thinking about surveillance, everyone worries about government agencies like the NSA and big corporations like Google and Facebook,” Eckert wrote in a blog post. “But actually there are hundreds of companies that have also discovered data collection as a revenue source… Most of them keep their data to themselves, some exchange it, but a few sell it to anyone who’s willing to pay,”
The researchers posed as a marketing company that wanted to buy browsing information to train its machine learning tools, and it took them about two weeks to obtain one month’s worth of browsing information on three million German users, compiling a database of 3 billion URLs spread across 9 million sites.
The information was so sensitive they deleted it after the investigation for fear it might fall into the hands of hackers. In her blog post Eckert said the way browser extension companies collect and resell user data is “often illegal” under European law.
A data broker provided Eckert and Dewes for free with information obtained from browser plugins including Web Of Trust (WOT), which, ironically, provides reviews of websites’ privacy practices.
After German public broadcasting network NDR published a report based on the study last November, WOT reworked its extensions and mobile app to better protect users’ anonymity, also giving users the ability to opt out of data collection.
But Eckert and Dewes said it’s next to impossible to make a clickstream fully anonymous.
“High-dimensional, user-related data is really hard to robustly anonymise, even if you really try to do so,” they said in the DefCon presentation.
Users who want to anonymise the clickstream themselves can use services such as TOR or a VPN with rotating exit nodes, or client-side software that blocks trackers, they said.
How much do you know about privacy? Try our quiz!
Yanluowang ransomware hackers claim credit for compromise of Cisco's corporate network in May, while Cisco…