Differentially Private PCA#

This notebook documents making a differentially private PCA release.


Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!

[1]:
from opendp.mod import enable_features
enable_features("contrib", "floating-point", "honest-but-curious")
[2]:
import numpy as np

def sample_microdata(*, num_columns=None, num_rows=None, cov=None):
    cov = cov or sample_covariance(num_columns)
    microdata = np.random.multivariate_normal(
        np.zeros(cov.shape[0]), cov, size=num_rows or 100_000
    )
    microdata -= microdata.mean(axis=0)
    return microdata

def sample_covariance(num_features):
    A = np.random.uniform(0, num_features, size=(num_features, num_features))
    return A.T @ A

In this notebook we’ll be working with an example dataset generated from a random covariance matrix.

[3]:
num_columns = 4
num_rows = 10_000
example_dataset = sample_microdata(num_columns=num_columns, num_rows=num_rows)

Releasing a DP PCA model with the OpenDP Library is easy because it provides an API similar to scikit-learn:

[4]:
import opendp.prelude as dp

model = dp.sklearn.PCA(
    epsilon=1.,
    row_norm=1.,
    n_samples=num_rows,
    n_features=4,
)

A private release occurs when you fit the model to the data.

[5]:
model.fit(example_dataset)
[5]:
PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The fitted model can then be introspected just like Scikit-Learn’s non-private PCA:

[6]:
print("singular values", model.singular_values_)
print("components")
model.components_
singular values [15.40825945 30.95765559 51.64750761 78.25285485]
components
[6]:
array([[ 0.32635704,  0.63916974,  0.62412528,  0.30890252],
       [ 0.84399485,  0.11060222, -0.5202029 , -0.06948945],
       [-0.42549121,  0.70204137, -0.557553  ,  0.12340906],
       [ 0.01100033, -0.29388281, -0.17026812,  0.94048958]])

Instead of fitting the model, you could instead retrieve the measurement used to make the release, just like other OpenDP APIs. This time, we’ll also only fit 2 components. Because of this, more budget will be allocated to estimating each eigenvector internally.

[7]:
model = dp.sklearn.PCA(
    epsilon=1.,
    row_norm=1.,
    n_samples=num_rows,
    n_features=4,
    n_components=2 # only estimate 2 of 4 components this time
)
meas = model.measurement()

The measurement fits model and then returns model:

[8]:
meas(example_dataset)
[8]:
PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.measurement() makes it more convenient to use the Scikit-Learn API with other combinators, like compositors.

[9]:
print("singular values", model.singular_values_)
print("components")
model.components_
singular values [15.70942634 30.92864075]
components
[9]:
array([[ 0.54788797,  0.64408591,  0.36746136,  0.38722636],
       [ 0.66629274, -0.06883045, -0.73004368, -0.13547169]])

Please reach out on Slack if you need to a more tailored analysis: there are lower-level APIs for estimating only the eigenvalues or eigenvectors, or to avoid mean estimation when your data is already bounded.