.. _dp-pca:

Differentially Private PCA
==========================

This notebook documents making a differentially private PCA release.

--------------

Any constructors that have not completed the proof-writing and vetting
process may still be accessed if you opt-in to “contrib”. Please contact
us if you are interested in proof-writing. Thank you!

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> import opendp.prelude as dp
            >>> dp.enable_features("contrib", "floating-point", "honest-but-curious")

            >>> import numpy as np
            
            >>> def sample_microdata(*, num_columns=None, num_rows=None, cov=None):
            ...     cov = cov or sample_covariance(num_columns)
            ...     microdata = np.random.multivariate_normal(
            ...         np.zeros(cov.shape[0]), cov, size=num_rows or 100_000
            ...     )
            ...     microdata -= microdata.mean(axis=0)
            ...     return microdata
            
            >>> def sample_covariance(num_features):
            ...     A = np.random.uniform(0, num_features, size=(num_features, num_features))
            ...     return A.T @ A
            

In this notebook we’ll be working with an example dataset generated from
a random covariance matrix.

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> num_columns = 4
            >>> num_rows = 10_000
            >>> example_dataset = sample_microdata(num_columns=num_columns, num_rows=num_rows)
            

Releasing a DP PCA model with the OpenDP Library is easy because it
provides an API similar to scikit-learn:

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> model = dp.sklearn.decomposition.PCA(
            ...     epsilon=1.,
            ...     row_norm=1.,
            ...     n_samples=num_rows,
            ...     n_features=4,
            ... )
            

A private release occurs when you fit the model to the data.

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> model.fit(example_dataset)
            PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)

The fitted model can then be introspected just like Scikit-Learn’s
non-private PCA:

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> print(model.singular_values_)
            [... ... ... ...]
            >>> print(model.components_)
            [[... ... ... ...]
             [... ... ... ...]
             [... ... ... ...]
             [... ... ... ...]]

Instead of fitting the model, you could instead retrieve the measurement
used to make the release, just like other OpenDP APIs. This time, we’ll
also only fit 2 components. Because of this, more budget will be
allocated to estimating each eigenvector internally.

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> model = dp.sklearn.decomposition.PCA(
            ...     epsilon=1.,
            ...     row_norm=1.,
            ...     n_samples=num_rows,
            ...     n_features=4,
            ...     n_components=2 # only estimate 2 of 4 components this time
            ... )
            >>> meas = model.measurement()
            

The measurement fits ``model`` and then returns ``model``:

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> meas(example_dataset)
            PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)
            
``.measurement()`` makes it more convenient to use the Scikit-Learn API
with other combinators, like compositors.

.. tab-set::

    .. tab-item:: Python
        :sync: python

        .. code:: python

            >>> print(model.singular_values_)
            [... ...]
            >>> print(model.components_)
            [[... ... ... ...]
             [... ... ... ...]]


Please reach out on Slack if you need to a more tailored analysis: there
are lower-level APIs for estimating *only* the eigenvalues or
eigenvectors, or to avoid mean estimation when your data is already
bounded.