XRay

New tool to increase the Web's transparency

The Problem

We live in a data-driven world. Many of the Web services, mobile apps, and third parties we interact with daily are collecting immense amounts of information about us – every location, click, search, email, document, and site that we visit. And they are using all of this information for various purposes. Some uses of these uses might be beneficial for us (e.g., recommendations for new videos or songs to see); other uses may not be as beneficial. The problem is that we have limited visibility into how our data is being used, and hence we are vulnerable to potential abuses.

For example, did you know that credit companies might be adjusting loan offers based on your Facebook data? Or that certain travel companies used to discriminate prices based on user profile and location? Or that some companies target ads on illness-related emails, and if you click on them, you can leak sensitive information to them? Maybe you already knew these things in the abstract, but do you always know when such things are happening to you? Not always, we bet.

At Columbia, we have been pondering over the past several years on the following related question: Can we build tools that increase visibility into what Web services are doing with users’ data? If Web services are tracking our data, we wish in turn to track their use of it. For example, wouldn’t it be great if we knew which emails trigger which ads, which prior purchases trigger which recommendations or prices? Or whether our services share our data with third parties, and then how those parties use the data? We believe that such visibility would be valuable for users but also to auditors, such as researchers, journalists, or regulators, who can serve as watchdogs of this data-driven world.

Unfortunately, revealing data use in the uncontrolled Web is incredibly difficult, and hardly any tools exist to do so. Worse, the scientific foundations – the algorithms, mechanisms, and protocols – for doing so are largely non-existent. While some tools (e.g., this, this, this) exist for revealing data collection by Web services, none of them can reveal data use. Our research, then, aims to build both the tools and the scientific building blocks necessary to reveal data use on the Web.

XRay

Today, we are releasing XRay, the first tool for revealing personal data use on the Web. It reveals which specific data inputs (such as emails) are used to target which outputs (such as ads). It is general and can track data use both within and across arbitrary Web services. The key idea behind XRay is to detect targeting through black-box input/output correlation. XRay populates a series of extra accounts with subsets of the inputs and then looks at the differences and commonalities between the outputs that they get in order to obtain correlation. This mechanism is effective at detecting certain types of data uses, though not all. For its details, please refer to our research paper, which will appear in August at USENIX Security 2014, a top systems security conference.

Our current XRay prototype works with Gmail, YouTube, and Amazon. It can correlate ads in Gmail to the emails they target, and recommendations in YouTube and Amazon based on previously viewed videos and products, respectively. However, XRay’s correlation mechanism – its “brain” – is service-agnostic and can be reused as a building block to construct future tools that reveal targeting in other services.

We evaluated XRay across the three services it currently supports. Unlike Amazon and YouTube, Gmail does not provide detailed explanations of its targeting, so we manually validated XRay’s correlations. For all these very different services, XRay predicted targeting with 80-90% accuracy without requiring a single change in its correlation mechanisms. Moreover, XRay we have proven both theoretically and experimentally that XRay scales surprisingly well, requiring only a modest number of extra accounts to track use of a large number of inputs.

We know of no other system that comes close to XRay’s generality, accuracy, or scale at detecting targeting on the Web. We hope that its reusable components can bolster the creation of a new generation of auditing tools that will help lift the curtain on how personal data is being used. We thus deem XRay as a major new step toward increased transparency in this data-driven Web.

What We Release

While our long-term plans for XRay and Web transparency are ambitious, our prototype is still in a research stage. Many difficult challenges remain open for revealing data use in this complex Web world, including robustness in face of malicious services, usability, and ease of instantiation on more services.

To spur further progress in this important, and largely unexplored, area of Web transparency, we are releasing several artifacts:

Our prototype’s source code, which can be used by researchers to both improve XRay and instantiate new tools on it to reveal data targeting on new Web services.
Our

USENIX Security paper

</a>, which gives the necessary details to understand our system’s design, quirks, and limitations. It should be read before using our prototype!