Differentially Private PCA#
This notebook documents making a differentially private PCA release.
Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!
>>> import opendp.prelude as dp
>>> dp.enable_features("contrib", "floating-point", "honest-but-curious")
>>> import numpy as np
>>> def sample_microdata(*, num_columns=None, num_rows=None, cov=None):
... cov = cov or sample_covariance(num_columns)
... microdata = np.random.multivariate_normal(
... np.zeros(cov.shape[0]), cov, size=num_rows or 100_000
... )
... microdata -= microdata.mean(axis=0)
... return microdata
>>> def sample_covariance(num_features):
... A = np.random.uniform(0, num_features, size=(num_features, num_features))
... return A.T @ A
In this notebook we’ll be working with an example dataset generated from a random covariance matrix.
>>> num_columns = 4
>>> num_rows = 10_000
>>> example_dataset = sample_microdata(num_columns=num_columns, num_rows=num_rows)
Releasing a DP PCA model with the OpenDP Library is easy because it provides an API similar to scikit-learn:
>>> model = dp.sklearn.decomposition.PCA(
... epsilon=1.,
... row_norm=1.,
... n_samples=num_rows,
... n_features=4,
... )
A private release occurs when you fit the model to the data.
>>> model.fit(example_dataset)
PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)
The fitted model can then be introspected just like Scikit-Learn’s non-private PCA:
>>> print(model.singular_values_)
[... ... ... ...]
>>> print(model.components_)
[[... ... ... ...]
[... ... ... ...]
[... ... ... ...]
[... ... ... ...]]
Instead of fitting the model, you could instead retrieve the measurement used to make the release, just like other OpenDP APIs. This time, we’ll also only fit 2 components. Because of this, more budget will be allocated to estimating each eigenvector internally.
>>> model = dp.sklearn.decomposition.PCA(
... epsilon=1.,
... row_norm=1.,
... n_samples=num_rows,
... n_features=4,
... n_components=2 # only estimate 2 of 4 components this time
... )
>>> meas = model.measurement()
The measurement fits model
and then returns model
:
>>> meas(example_dataset)
PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)
.measurement()
makes it more convenient to use the Scikit-Learn API
with other combinators, like compositors.
>>> print(model.singular_values_)
[... ...]
>>> print(model.components_)
[[... ... ... ...]
[... ... ... ...]]
Please reach out on Slack if you need to a more tailored analysis: there are lower-level APIs for estimating only the eigenvalues or eigenvectors, or to avoid mean estimation when your data is already bounded.