Proper Dataset Valuation by Pointwise Mutual Information

Sunday, Jun.22, 2025


Time:   3:30 p.m. — 4:10 p.m.
Location: 武汉大学-雷军楼一楼报告厅 

Shuran Zheng
Tsinghua University
Title:  Proper Dataset Valuation by Pointwise Mutual Information
Abstract:   Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart’s law: when a measure becomes a target, it ceases to be a good measure.To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model’s parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
CV:   Shuran Zheng is a tenure-track Assistant Professor in the Institute for Interdisciplinary Information Sciences at Tsinghua University. She obtained my Ph.D. in Computer Science from Harvard University, and was a postdoctoral researcher at Carnegie Mellon University, a Student Researcher in the Market Algorithms Group at Google Research NYC. Her research lies at the intersection of Computer Science and Economics, and she is particularly interested in understanding the value of data and information. She explores various areas including data valuation, data markets, information elicitation, information aggregation, and information design.