TPDP 2025 will take place on June 2 and 3 at Google in Mountain View, CA.
Venue Information: Click here for venue information, including exact location of the workshop, parking information, and check-in information.
Logistics: The workshop will be held at Google Mountain View. Nearby airports include San Francisco International Airport (SFO) and San Jose International Airport (SJC). Nearby hotels include the Ameswell Hotel, Hyatt Centric Mountain View, Shashi Hotel Mountain View Palo Alto, and Aloft Mountain View.
Registration: Registration is closed.
DC-Area Watch Party: Christine Task at Knexus Research is hosting a DC-area Watch Party for those that can't travel to CA, but would still like to meet in person to watch presentations and discuss research. The Watch Party will take place at Knexus Research (1951 Kidwell Dr, Vienna VA 22182). It will run 11:30am - 6pm EDT on June 2, and 11:30am - 3pm EDT on June 3rd. Lunch is brown bag (bring your own) with snacks and coffee/tea provided, and a group dinner will be organized if there is sufficient interest. Please register here for the DC-area watch party by May 23. Registration for the DC-Area Watch Party is separate from TPDP registration (but also free) - please fill out both registration forms if you plan to attend. Contact Christine Task with DC-specific questions.
8:00-9:00 |
Breakfast
|
|
9:00-9:05 |
Welcome
|
|
9:05-9:50 |
Keynote #1
Practical differentially private statistical estimation should ideally combine strong error guarantees, computational efficiency, robustness, and minimal reliance on user-specified assumptions. In this talk, I will highlight recent progress toward these goals, through the fundamental problem of mean estimation. In the first part, I will present differentially private algorithms for high-dimensional mean estimation whose error optimally adapts to the effective dimensionality of the distribution. These estimators can achieve dimension-free error whenever possible—for instance, for distributions that are concentrated on a small number of principal components—overcoming a limitation of prior methods, which suffer from a curse of dimensionality and require sample sizes that scale with the ambient dimension even in these favorable regimes. In the second part, I will discuss recent results that uncover both limitations and new directions for designing computationally efficient estimators with affine-invariant error guarantees and robustness properties. This talk is primarily based on joint work with Gavin Brown, Yuval Dagan, Michael Jordan, Xuelin Yang, and Nikita Zhivotovskiy. |
|
9:50-10:35 |
Contributed Talks: Session #1
We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning, obtaining a better utility--unlearning tradeoff by replacing worst-case privacy loss bounds with per-instance privacy losses, each of which bounds the (Renyi) divergence to retraining without an individual data point. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points. We consider the privacy privacy amplification properties of a sampling scheme in which a user's data is used in $k$ steps chosen randomly and uniformly from a sequence (or set) of $t$ steps. This sampling scheme has been recently applied in the context of differentially private optimization [Chua et al., 2024a, Choquette-Choo et al., 2024] and is also motivated by communication-efficient high-dimensional private aggregation [Asi et al., 2025]. Existing analyses of this scheme either rely on privacy amplification by shuffling which leads to overly conservative bounds or require Monte Carlo simulations that are computationally prohibitive in most practical scenarios. We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certainty. In our work, we instead focus on the regime where many possible datasets match the published statistics, making it impossible to reconstruct the entire private dataset perfectly (i.e., when approaches in prior work fail). We propose the problem of partial data reconstruction, in which the goal of the adversary is to instead output a subset of rows and/or columns that are guaranteed to be correct. We introduce a novel integer programming approach that first generates a set of claims and then verifies whether each claim holds for all possible datasets consistent with the published aggregates. We evaluate our approach on the housing-level microdata from the U.S. Decennial Census release, demonstrating that privacy violations can still persist even when information published about such data is relatively sparse. |
|
10:35-11:00 |
Break |
|
11:00-12:30 |
Poster Session #1
|
|
12:30-1:30 |
Lunch (provided) |
|
1:30-2:30 |
Panel Discussion: Using DP Data Panelists: Jörg Drechsler (Institute for Employment Research), Stefano Iacus (Harvard), Harikesh Nair (Google) |
|
2:30-2:50 |
Break |
|
2:50-3:35 |
Contributed Talks: Session #2
We propose a simple heuristic privacy analysis of noisy clipped stochastic gradient descent (DP-SGD) in the setting where only the last iterate is released and the intermediate iterates remain hidden. Namely, our heuristic assumes a linear structure for the model. We show experimentally that our heuristic is predictive of the outcome of privacy auditing applied to various training procedures. Thus it can be used prior to training as a rough estimate of the final privacy leakage. We also probe the limitations of our heuristic by providing some artificial counterexamples where it underestimates the privacy leakage. The standard composition-based privacy analysis of DP-SGD effectively assumes that the adversary has access to all intermediate iterates, which is often unrealistic. However, this analysis remains the state of the art in practice. While our heuristic does not replace a rigorous privacy analysis, it illustrates the large gap between the best theoretical upper bounds and the privacy auditing lower bounds and sets a target for further work to improve the theoretical privacy analyses. We also empirically support our heuristic and show existing privacy auditing attacks are bounded by our heuristic analysis in both vision and language tasks. Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we demonstrate that several simple and well-studied binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when a single $d$-dimensional example is available, which then decays as $\Theta(d/n)$ as the number of examples grows (for $n\leq \sqrt{d}$). Further, this rate is achieved (up to logarithmic factors) by simple learning algorithms. Our results build on the work of Brown et al. (2021) and establish a new framework for proving memorization lower bounds that is based on an approximate version of strong data processing inequalities. We study differentially private algorithms for analyzing graphs in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time, and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only (called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA 21), and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML 23). We provide the first differentially private and fully dynamic graph algorithms for several other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error and show strong lower bounds on the error for all algorithms in this setting. We study two variants of edge differential privacy for fully dynamic graph algorithms: event-level and item-level. We give upper and lower bounds on the error of both event-level and item-level fully dynamic algorithms for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems, our algorithms match our lower bounds. |
|
3:40-5:10 |
Poster Session #2
|
|
5:20-7:00 |
Reception
|
8:00-9:00 |
Breakfast
|
|
9:00-9:50 | Keynote #2
In this talk I try to stitch together some recent work - some my own and some others’ - to probe practical, important, and intertwined questions: How can we provide and evaluate differentially private (DP) synthetic data to address epistemic concerns of practitioners? Are we targeting the problems practitioners face every day - say, DP under extreme class imbalance, for e.g., EHRs and fraud detection - or merely the ones that make tidy papers? With the advent of LLM supremacy, how do we want to treat its output (Is it public? Is it useful under DP?) Do we understand the right LLM threat models, and do DP defenses make sense? I will trace these questions from theory to practice as best I can, and frame open challenges that I hope we can address as a community. |
|
9:50-10:35 |
Contributed Talks: Session #3
We initiate the study of differentially private learning in the proportional dimensionality regime, in which the number of data samples n and problem dimension d approach infinity at rates proportional to one another, meaning that d/n-->delta as n-->infinity for an arbitrary, given constant 0<$delta$<$infinity$. This setting is significantly more challenging than that of all prior theoretical work in high-dimensional differentially private learning, which, despite the name, has assumed that delta=0 or is sufficiently small for problems of sample complexity O(d), a regime typically considered “low-dimensional” or “classical” by modern standards in high-dimensional statistics. This paper studies the problem of differentially private empirical risk minimization (DP-ERM) for binary linear classification. We obtain an efficient $(\varepsilon,\delta)$-DP algorithm with an empirical zero-one risk bound of $\tilde{O}\left(\frac{1}{\gamma^2\varepsilon n} + \frac{|S_{\mathrm{out}}|}{\gamma n}\right)$ where $n$ is the number of data points, $S_{\mathrm{out}}$ is an arbitrary subset of data one can remove and $\gamma$ is the margin of linear separation of the remaining data points (after $S_{\mathrm{out}}$ is removed). Here, $\tilde{O}(\cdot)$ hides only logarithmic terms. In the agnostic case, we improve the existing results when the number of outliers is small. Our algorithm is highly adaptive because it does not require knowing the margin parameter $\gamma$ or outlier subset $S_{\mathrm{out}}$. We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example may be attributed to multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Training in the multi-attribution model is facilitated by solving the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding problem. We then empirically study this algorithm for a synthetic logistic regression task and a transformer training task, including studying variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of a bias-variance tradeoff inherent in solutions to the contribution bounding problem. |
|
10:35-11:00 |
Break |
|
11:00-12:30 |
Panel Discussion: DP Law and Policy
Panelists: Ryan Steed (Princeton), Mayana Pereira (Capital One), Nitin Kohli (UC Berkeley), Alex Wood (Harvard) |
|
12:30-1:30 |
Lunch (provided) |
|
1:30-2:30 |
Contributed Talks: Session #4
Differential Privacy has become the gold standard for quantifying privacy in data analysis and machine learning algorithms. Thanks to nearly two decades of research, there are now multiple competing notions of differential privacy—(ε,δ)-DP, Rényi DP, and Privacy Loss Distribution (PLD) tail—each with its own advantages and limitations. In this talk, I will describe how all these notions can be viewed as primal and dual formulations of each other through Laplace transforms. Notably, the Laplace transform expressions of DP provides an elegant framework for reasoning about DP and its properties. To showcase this perspective I will introduce a new composition theorem for (ε,δ)-DP that is tight—even in the constants—and improves upon the result of Kairouz et al. (2015), which was previously considered optimal. We further explore applications of the fingerprinting method in private and adaptive data analysis. By combining the exponential family with a proper version of Stoke's theorem, we develop a simple and versatile fingerprinting framework, and prove several new results: First, we show a sample complexity lower bound of $\widetilde{Omega(\log(Q) \sqrt(\log(N)) / \alpha^3)$ for adaptive data analysis on linear queries, assuming the algorithm is both distrbutional and emprically accurate. Up to the ``emprically accurate'' caveat, this is nearly tight and matches the upper bound from [BNSSSU'16]. Secondly, we characterize the sample complexity for answering a workload of $Q$ random counting queries over a universe of size $N$, in both ``high accuracy'' and ``low accuracy'' regimes. Thirdly, we revisit the [BUV'14] sample complexity lower bound for answering a worst-case ensemble of counting queries, and improve the lower bound therein by a $\sqrt{\log(1/\delta)}$ factor, thereby achieving a tight-up-to-constant lower bound. We study the fundamental problem of estimating an unknown discrete distribution $p$ over $d$ symbols, given $n$ i.i.d. samples from the distribution. We are interested in minimizing the KL divergence between the true distribution and the algorithm's estimate. We first show that the standard minimax objective is too ``worst-case'' to shed light on optimality of algorithms for this problem, where we prove that simple DP estimators with poor empirical performance are already minimax optimal. This is because minimax objective focuses on minimizing the estimation error on the worst-case data distribution, and fails to shed light on an algorithm's performance on individual (non-worst-case) instances $p$. Thus, we study this problem from an instance-optimality viewpoint, where the algorithm's error on $p$ is compared to the minimum achievable estimation error over a small local neighborhood of $p$. Under natural notions of local neighborhood, we propose algorithms that achieve instance-optimality up to constant factors, with and without a differential privacy constraint. Our upper bounds rely on (private) variants of the Good-Turing estimator. Our lower bounds use additive local neighborhoods that more precisely captures the hardness of distribution estimation in KL divergence, compared to the permutation and multiplicative neighborhoods considered in prior works. We propose a framework to convert $(\varepsilon, \delta)$-approximate Differential Privacy (DP) mechanisms into $(\varepsilon', 0)$-pure DP mechanisms under certain conditions, a process we call ``purification.'' This algorithmic technique leverages randomized post-processing with calibrated noise to eliminate the $\delta$ parameter while achieving near-optimal privacy-utility tradeoff for pure DP. It enables a new design strategy for pure DP algorithms: first run an approximate DP algorithm with certain conditions, and then purify. This approach allows one to leverage techniques such as strong composition and propose-test-release that require $\delta>0$ in designing pure-DP methods with $\delta=0$. We apply this framework in various settings, including Differentially Private Empirical Risk Minimization (DP-ERM), stability-based release, and query release tasks. To the best of our knowledge, this is the first work with a statistically and computationally efficient reduction from approximate DP to pure DP. Finally, we illustrate the use of this reduction for proving lower bounds under approximate DP constraints with explicit dependence in $\delta$, avoiding the sophisticated fingerprinting code construction. https://arxiv.org/abs/2503.21071 |
|
2:30-4:00 |
Poster Session #3
|
|
4:00-4:45 |
Contributed Talks: Session #5
This work introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP ε=2 while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains. In distributed differential privacy, multiple parties collaboratively analyze their combined data while protecting the privacy of each party's data from the eyes of the others. Interestingly, for certain fundamental two-party functions like inner product and Hamming distance, the accuracy of distributed solutions significantly lags behind what can be achieved in the centralized model. However, under computational differential privacy, these limitations can be circumvented using oblivious transfer via secure multi-party computation. Yet, no results show that oblivious transfer is indeed necessary for accurately estimating a non-Boolean functionality. In particular, for the inner-product functionality, it was previously unknown whether oblivious transfer is necessary even for the best possible constant additive error. In this work, we prove that any computationally differentially private protocol that estimates the inner product over {-1,1}^n x {-1,1}^n up to an additive error of O(n^{1/6}), can be used to construct oblivious transfer. In particular, our result implies that protocols with sub-polynomial accuracy are equivalent to oblivious transfer. In this accuracy regime, our result improves upon Haitner, Mazor, Silbak, and Tsfadia [STOC '22] who showed that a key-agreement protocol is necessary. The technical literature about data privacy largely consists of two complementary approaches: formal definitions of conditions sufficient for privacy preservation and attacks that demonstrate privacy breaches. Differential privacy is an accepted standard in the former sphere. However, differential privacy's powerful adversarial model and worst-case guarantees may make it too stringent in some situations, especially when achieving it comes at a significant cost to data utility. Meanwhile, privacy attacks aim to expose real and worrying privacy risks associated with existing data aggregation systems, but do not identify what properties are necessary to defend against them. |
Poster Session 1
Poster Session 2
Poster Session 3
Virtual Presentations
Differential privacy (DP) is the leading framework for data analysis with rigorous privacy guarantees. In the last two decades, it has transitioned from the realm of pure theory to large scale, real world deployments.
Differential privacy is an inherently interdisciplinary field, drawing researchers from a variety of academic communities including machine learning, statistics, security, theoretical computer science, databases, and law. The combined effort across a broad spectrum of computer science is essential for differential privacy to realize its full potential. To this end, this workshop aims to stimulate discussion among participants about both the state-of-the-art in differential privacy and the future challenges that must be addressed to make differential privacy more practical.
Specific topics of interest for the workshop include (but are not limited to):
Submissions: Authors are invited to submit a short abstract of new work or work published since June 2024 (the most recent TPDP submission deadline). Submissions must be 4 pages maximum, not including references. Submissions may also include appendices, but these are only read at reviewer's discretion. There is no prescribed style file, but authors should ensure a minimum of 1-inch margins and 10pt font. Submissions are not anonymized, and should include author names and affiliations.
Submissions will undergo a lightweight review process and will be judged on originality, relevance, interest, and clarity. Based on the volume of submissions to TPDP 2024 and the workshop's capacity constraints, we expect that the review process will be somewhat more competitive than in years past. Accepted abstracts will be presented at the workshop either as a talk or a poster.
The workshop will not have formal proceedings and is not intended to preclude later publication at another venue. In-person attendance is encouraged, though authors of accepted abstracts who cannot attend in person will be invited to submit a short video to be linked on the TPDP website.
Selected papers from the workshop will be invited to submit a full version of their work for publication in a special issue of the Journal of Privacy and Confidentiality.
We are very grateful to our sponsors whose generosity has been critical to the continued success of the workshop. For information about sponsorship opportunities, please contact us at tpdp.chairs@gmail.com.
For concerns regarding submissions, please contact tpdp.chairs@gmail.com