{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Differentially Private PCA\n", "\n", "This notebook documents making a differentially private PCA release." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to \"contrib\".\n", "Please contact us if you are interested in proof-writing. Thank you!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from opendp.mod import enable_features\n", "enable_features(\"contrib\", \"floating-point\", \"honest-but-curious\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def sample_microdata(*, num_columns=None, num_rows=None, cov=None):\n", " cov = cov or sample_covariance(num_columns)\n", " microdata = np.random.multivariate_normal(\n", " np.zeros(cov.shape[0]), cov, size=num_rows or 100_000\n", " )\n", " microdata -= microdata.mean(axis=0)\n", " return microdata\n", "\n", "def sample_covariance(num_features):\n", " A = np.random.uniform(0, num_features, size=(num_features, num_features))\n", " return A.T @ A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we'll be working with an example dataset generated from a random covariance matrix." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "num_columns = 4\n", "num_rows = 10_000\n", "example_dataset = sample_microdata(num_columns=num_columns, num_rows=num_rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Releasing a DP PCA model with the OpenDP Library is easy because it provides an API similar to scikit-learn:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import opendp.prelude as dp\n", "\n", "model = dp.sklearn.PCA(\n", " epsilon=1.,\n", " row_norm=1.,\n", " n_samples=num_rows,\n", " n_features=4,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A private release occurs when you fit the model to the data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "PCA(epsilon=1.0, n_components=4, n_features=4, n_samples=10000, row_norm=1.0)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(example_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fitted model can then be introspected just like Scikit-Learn's non-private PCA:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "singular values [15.40825945 30.95765559 51.64750761 78.25285485]\n", "components\n" ] }, { "data": { "text/plain": [ "array([[ 0.32635704, 0.63916974, 0.62412528, 0.30890252],\n", " [ 0.84399485, 0.11060222, -0.5202029 , -0.06948945],\n", " [-0.42549121, 0.70204137, -0.557553 , 0.12340906],\n", " [ 0.01100033, -0.29388281, -0.17026812, 0.94048958]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"singular values\", model.singular_values_)\n", "print(\"components\")\n", "model.components_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of fitting the model, you could instead retrieve the measurement used to make the release, just like other OpenDP APIs.\n", "This time, we'll also only fit 2 components.\n", "Because of this, more budget will be allocated to estimating each eigenvector internally." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "model = dp.sklearn.PCA(\n", " epsilon=1.,\n", " row_norm=1.,\n", " n_samples=num_rows,\n", " n_features=4,\n", " n_components=2 # only estimate 2 of 4 components this time\n", ")\n", "meas = model.measurement()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The measurement fits `model` and then returns `model`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "PCA(epsilon=1.0, n_components=2, n_features=4, n_samples=10000, row_norm=1.0)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "meas(example_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.measurement()` makes it more convenient to use the Scikit-Learn API with other combinators, like compositors." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "singular values [15.70942634 30.92864075]\n", "components\n" ] }, { "data": { "text/plain": [ "array([[ 0.54788797, 0.64408591, 0.36746136, 0.38722636],\n", " [ 0.66629274, -0.06883045, -0.73004368, -0.13547169]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"singular values\", model.singular_values_)\n", "print(\"components\")\n", "model.components_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please reach out on Slack if you need to a more tailored analysis:\n", "there are lower-level APIs for estimating _only_ the eigenvalues or eigenvectors, \n", "or to avoid mean estimation when your data is already bounded.\n" ] } ], "metadata": { "kernelspec": { "display_name": "opendp", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 2 }