Differentially Private PCA#
This notebook documents making a differentially private PCA release.
Any functions that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!
>>> import opendp.prelude as dp
>>> dp.enable_features(
... "contrib", "floating-point", "honest-but-curious"
... )
>>> import numpy as np
>>> def sample_microdata(
... *, num_columns=None, num_rows=None, cov=None
... ):
... cov = cov or sample_covariance(num_columns)
... microdata = np.random.multivariate_normal(
... np.zeros(cov.shape[0]),
... cov,
... size=num_rows or 100_000,
... )
... microdata -= microdata.mean(axis=0)
... return microdata
...
>>> def sample_covariance(num_features):
... A = np.random.uniform(
... 0, num_features, size=(num_features, num_features)
... )
... return A.T @ A
...
In this notebook we’ll be working with an example dataset generated from a random covariance matrix.
>>> num_columns = 4
>>> num_rows = 10_000
>>> example_dataset = sample_microdata(
... num_columns=num_columns, num_rows=num_rows
... )
Releasing a DP PCA model with the OpenDP Library is easy because it provides an API similar to scikit-learn:
>>> model = dp.sklearn.decomposition.PCA(
... epsilon=1.0,
... row_norm=1.0,
... n_samples=num_rows,
... n_features=4,
... )
A private release occurs when you fit the model to the data.
>>> model.fit(example_dataset)
PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)
The fitted model can then be introspected just like Scikit-Learn’s non-private PCA:
>>> print(model.singular_values_)
[... ... ... ...]
>>> print(model.components_)
[[... ... ... ...]
[... ... ... ...]
[... ... ... ...]
[... ... ... ...]]
Instead of fitting the model, you could instead retrieve the measurement used to make the release, just like other OpenDP APIs. This time, we’ll also only fit 2 components. Because of this, more budget will be allocated to estimating each eigenvector internally.
>>> model = dp.sklearn.decomposition.PCA(
... epsilon=1.0,
... row_norm=1.0,
... n_samples=num_rows,
... n_features=4,
... n_components=2, # only estimate 2 of 4 components this time
... )
>>> meas = model.measurement()
The measurement fits model
and then returns model
:
>>> meas(example_dataset)
PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)
.measurement()
makes it more convenient to use the Scikit-Learn API
with other combinators, like compositors.
>>> print(model.singular_values_)
[... ...]
>>> print(model.components_)
[[... ... ... ...]
[... ... ... ...]]
Please reach out on Slack if you need to a more tailored analysis: there are lower-level APIs for estimating only the eigenvalues or eigenvectors, or to avoid mean estimation when your data is already bounded.