Project maintained by columbia Hosted on GitHub Pages — Theme by mattgraham

Web transparency research at Columbia

Today’s Web services leverage users’ information – such as emails, search logs, or locations – and use them to target advertisements, prices, or products at users. Presently, users have little insight into how their data is used for such purposes. To enhance transparency, we are building a new set of tools system that detect what data – such as emails or searches – is used to target which ads in Gmail, which prices in Amazon, etc. The insight is to compare ads/prices witnessed by different accounts with similar, but not identical, subsets of the data.


the first tool for revealing personal data use on the Web. It reveals which specific data inputs (such as emails) are used to target which outputs (such as ads). It is general and can track data use both within and across arbitrary Web services. The key idea behind XRay is to detect targeting through black-box input/output correlation. XRay populates a series of extra accounts with subsets of the inputs and then looks at the differences and commonalities between the outputs that they get in order to obtain correlation.


Sunlight is an analysis pipeline that provides causal targeting detection with statistical confidence, and at scale. In the paper, we propose a 4 steps pipeline to form and assess targeting hypotheses. Our pipeline is build in a modular way, and allows extensive comparison of different algorithms. We highlight a fundamental scalability trade-off between the number of hypotheses we can make and the confidence we have in these hypotheses.

The team

Mathias Lecuyer, Riley Spahn, Yannis Spiliopoulos, Augustin Chaintreau, Roxana Geambasu, Daniel Hsu