Differential Privacy Under Dependency

This blog is a short discussion about my thoughts on some of the challenges faced by differential privacy.

Differential privacy (DP) is a pretty cool mathematical framework. It was inspired by observations that privacy cannot be retained under any exact statistical analysis. Say you take the sum of everyone’s age. This may appear to not give you any one person’s age, though assuming you can then obtain the sum of ages for any subset of this group, you actually can determine everyone’s age exactly. This even applies to data used to train an ML model.

The Good of DP

DP provides a guarantee: the best possible attacker will only be this successful in finding the exact age of every participant, even if they can choose which subgroups they want to know the age of. DP does this by calculating the amount of random noise required to achieve this. It’s pretty magical, but there really is a specific amount of noise required to prevent all attackers from succeeding at learning any one person’s age. The math is quite cool!

DP goes further to show that by defending against all possible attackers, we have some further properties:

  • Any further transformations to the data cannot reduce the privacy guarantee
  • It will protect any type of data (that’s iid)
  • Even if the attacker knows the entire rest of the dataset, they still can’t find that one last person’s age!

Additionally, although this is difficult to quantify, but DP has an impressive ability to adapt to different situations in data science. It started out only protecting against sums, but has been extended to ML models, stochastic ordinary differential equations, and likely any statistical analysis you could be interested in. The vast majority of other privacy-enhancing technologies end up being extremely specific to the situation.

The Bad of DP

The real major concern with DP is its panacea-like status. This was first noted right as DP started being picked up by other (non-theoretical math) fields. It does indeed feel like a simple framework to apply to any problem and rest assured that everything is now private.

As with ML, DP has drawbacks that can create very dangerous situations when a practitioner fails to consider them.

The most widely understood drawback of DP is that it hurts performance. For instance, by training a classification model with DP, you should expect to received a lower F1 score, compared to the same model trained without DP. There are of course exceptions to this, though generally the intuition of DP = worse performance applies quite universally. Methods come out every week on how to reduce he performance penalty incurred by DP on a specific situation.

However, I’d argue that the performance penalty is not a large issue. Practitioners have a good understanding of it, and it’s hard to miss the performance impact. Instead, the far more concerning aspects are that:

  • DP doesn’t work (at all) for dependent data
  • When applied naively to ML, it often hurts fairness disproportionally
  • A large epsilon is meaningless, there is no privacy advantage there

DP Under Dependency

DP assume the data is sampled IID. This is a rather fair assumption in theoretical work, as many of the most widely used algorithms (such as SGD) make the same assumption. However, in reality most data IS dependent! This is a huge problem for practitioners as they simply cannot apply DP unless the data is known to be independent. Unfortunately, this is overlooked in the vast majority of literature, which effectively nullifies the point of DP.

There has been work on this problem, as it’s a huge red flag for wider adoption. In 2015, a rather nice paper showed how DP can work if (1) you know how many datapoints are dependent on one another and (2) you know the “strength” of this dependency. This was excellent theoretical work, though it’s still completely impractical as it’s incredibly difficult for anyone to know the underlying dependencies in a dataset. Since then, the problem has seen little progress, as dependency ruins the underlying assumptions of DP.

The problem is actually far more concerning when you realize the widespread use of federated learning. This ML framework hinges on the idea of dependent datasets, which makes DP a terrible fit for the majority of practical applications.

DP’s Disparate Impact