Differentially Private PCA#

This notebook documents making a differentially private PCA release.


Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!

>>> import opendp.prelude as dp
>>> dp.enable_features("contrib", "floating-point", "honest-but-curious")

>>> import numpy as np

>>> def sample_microdata(*, num_columns=None, num_rows=None, cov=None):
...     cov = cov or sample_covariance(num_columns)
...     microdata = np.random.multivariate_normal(
...         np.zeros(cov.shape[0]), cov, size=num_rows or 100_000
...     )
...     microdata -= microdata.mean(axis=0)
...     return microdata

>>> def sample_covariance(num_features):
...     A = np.random.uniform(0, num_features, size=(num_features, num_features))
...     return A.T @ A

In this notebook we’ll be working with an example dataset generated from a random covariance matrix.

>>> num_columns = 4
>>> num_rows = 10_000
>>> example_dataset = sample_microdata(num_columns=num_columns, num_rows=num_rows)

Releasing a DP PCA model with the OpenDP Library is easy because it provides an API similar to scikit-learn:

>>> model = dp.sklearn.decomposition.PCA(
...     epsilon=1.,
...     row_norm=1.,
...     n_samples=num_rows,
...     n_features=4,
... )

A private release occurs when you fit the model to the data.

>>> model.fit(example_dataset)
PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)

The fitted model can then be introspected just like Scikit-Learn’s non-private PCA:

>>> print(model.singular_values_)
[... ... ... ...]
>>> print(model.components_)
[[... ... ... ...]
 [... ... ... ...]
 [... ... ... ...]
 [... ... ... ...]]

Instead of fitting the model, you could instead retrieve the measurement used to make the release, just like other OpenDP APIs. This time, we’ll also only fit 2 components. Because of this, more budget will be allocated to estimating each eigenvector internally.

>>> model = dp.sklearn.decomposition.PCA(
...     epsilon=1.,
...     row_norm=1.,
...     n_samples=num_rows,
...     n_features=4,
...     n_components=2 # only estimate 2 of 4 components this time
... )
>>> meas = model.measurement()

The measurement fits model and then returns model:

>>> meas(example_dataset)
PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)

.measurement() makes it more convenient to use the Scikit-Learn API with other combinators, like compositors.

>>> print(model.singular_values_)
[... ...]
>>> print(model.components_)
[[... ... ... ...]
 [... ... ... ...]]

Please reach out on Slack if you need to a more tailored analysis: there are lower-level APIs for estimating only the eigenvalues or eigenvectors, or to avoid mean estimation when your data is already bounded.