{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Working with Unknown Dataset Sizes\n", "\n", "This notebook demonstrates the features built into OpenDP to handle unknown or private dataset sizes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load exemplar dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-04-15T16:41:40.471981Z", "iopub.status.busy": "2021-04-15T16:41:40.458021Z", "iopub.status.idle": "2021-04-15T16:41:40.634785Z", "shell.execute_reply": "2021-04-15T16:41:40.635247Z" } }, "outputs": [], "source": [ "# Define parameters up-front\n", "# Each parameter is either a guess, a DP release, or public information\n", "var_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"] # public information\n", "age_bounds = (0., 120.) # an educated guess\n", "age_prior = 38. # average age for entire US population (public information)\n", "size = 1000 # records in dataset, public information\n", "\n", "# Load data\n", "import os\n", "import numpy as np\n", "data_path = os.path.join('..', 'data', 'PUMS_california_demographics_1000', 'data.csv')\n", "age = np.genfromtxt(data_path, delimiter=',', names=var_names)[:]['age'].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By looking at the private data, we see this dataset has 1000 observations (rows).\n", "Sometimes the number of observations is public information.\n", "For example, a researcher might run a random poll of 1000 respondents and publicly announce the sample size.\n", "\n", "However, there are cases where simply the number of observations itself can leak private information.\n", "For example, if a dataset contained all the individuals with a rare disease in a community,\n", "then knowing the size of the dataset would reveal how many people in the community had that condition.\n", "In general, any given dataset may be some well-defined subset of a population.\n", "The given dataset's size is equivalent to a count query on that subset,\n", "so we should protect the dataset size just as we would protect any other query we want to provide privacy guarantees for.\n", "\n", "OpenDP assumes the sample size is private information.\n", "If you know the dataset size (or any other parameter) is publicly available,\n", "then you are free to make use of such information while building your measurement.\n", "\n", "OpenDP will not assume you truthfully or correctly know the size of the dataset.\n", "Moreover, OpenDP cannot respond with an error message if you get the size incorrect;\n", "doing so would permit an attack whereby an analyst could repeatedly guess different dataset sizes until the error message went away,\n", "thereby leaking the exact dataset size.\n", "\n", "If we know the dataset size, we can incorporate it into the analysis as below,\n", "where we provide `size` as an argument to a DP mean measurement on age." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-04-15T16:41:40.645482Z", "iopub.status.busy": "2021-04-15T16:41:40.644181Z", "iopub.status.idle": "2021-04-15T16:41:40.686094Z", "shell.execute_reply": "2021-04-15T16:41:40.685529Z" }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DP mean: 44.65291940000049\n" ] } ], "source": [ "from opendp.transformations import *\n", "from opendp.measurements import make_base_laplace, make_base_discrete_laplace\n", "from opendp.mod import enable_features, binary_search_chain\n", "\n", "enable_features(\"contrib\", \"floating-point\")\n", "\n", "dp_mean = (\n", " # Clamp age values\n", " make_clamp(bounds=age_bounds) >>\n", " # Resize with the known `size`\n", " make_bounded_resize(size=size, bounds=age_bounds, constant=age_prior) >>\n", " # Aggregate\n", " make_sized_bounded_mean(size=size, bounds=age_bounds) >>\n", " # Noise\n", " make_base_laplace(scale=1.)\n", ")\n", "\n", "print(\"DP mean:\", dp_mean(age))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Providing incorrect dataset size values\n", "\n", "However, if we provide an incorrect value of `size` we still receive an answer.\n", "\n", "`make_mean_measurement` is just a convenience constructor for building a mean measurement from a `size` argument." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-04-15T16:41:40.694235Z", "iopub.status.busy": "2021-04-15T16:41:40.693539Z", "iopub.status.idle": "2021-04-15T16:41:40.711013Z", "shell.execute_reply": "2021-04-15T16:41:40.711551Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DP mean (n=200): 44.59649541816684\n", "DP mean (n=1000): 45.723444891570175\n", "DP mean (n=2000): 46.28379378485838\n" ] } ], "source": [ "def make_mean_measurement(size):\n", " return make_clamp(age_bounds) >> \\\n", " make_bounded_resize(size=size, bounds=age_bounds, constant=age_prior) >> \\\n", " make_sized_bounded_mean(size=size, bounds=age_bounds) >> \\\n", " make_base_laplace(scale=1.0)\n", "\n", "lower_n = make_mean_measurement(size=200)(age)\n", "real_n = make_mean_measurement(size=1000)(age)\n", "higher_n = make_mean_measurement(size=2000)(age)\n", "\n", "print(\"DP mean (n=200): {0}\".format(lower_n))\n", "print(\"DP mean (n=1000): {0}\".format(real_n))\n", "print(\"DP mean (n=2000): {0}\".format(higher_n))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis with no provided dataset size\n", "If we do not believe we have an accurate estimate for `size` we can instead pay some of our privacy budget\n", "to estimate the dataset size.\n", "Then we can use that estimate in the rest of the analysis.\n", "Here is an example:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-04-15T16:41:40.731918Z", "iopub.status.busy": "2021-04-15T16:41:40.731318Z", "iopub.status.idle": "2021-04-15T16:41:40.740106Z", "shell.execute_reply": "2021-04-15T16:41:40.739600Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DP count: 999\n", "DP mean: 44.80287323701519\n" ] } ], "source": [ "# First, estimate the number of records in the dataset.\n", "dp_count = make_count(TIA=float) >> make_base_discrete_laplace(scale=1.)\n", "dp_count_release = dp_count(age)\n", "print(\"DP count: {0}\".format(dp_count_release))\n", "\n", "# Then reuse the count to create a dp_mean measurement that resizes the dataset.\n", "dp_mean = make_mean_measurement(dp_count_release)\n", "dp_mean_release = dp_mean(age)\n", "print(\"DP mean: {0}\".format(dp_mean_release))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "There is an interesting trade-off to this approach, that can be demonstrated visually via simulations.\n", "However, before we move on to the simulation, let's make a few helper functions for building measurements that consume a specified privacy budget." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from functools import lru_cache\n", "\n", "@lru_cache(maxsize=None)\n", "def make_count_with(*, epsilon):\n", " counter = make_count(TIA=float)\n", " return binary_search_chain(\n", " lambda s: counter >> make_base_discrete_laplace(scale=s),\n", " d_in=1, d_out=epsilon, \n", " bounds=(0., 10000.))\n", "\n", "@lru_cache(maxsize=None)\n", "def make_mean_with(*, data_size, epsilon):\n", " mean_chain = (\n", " # Clamp age values\n", " make_clamp(bounds=age_bounds) >>\n", " # Resize the dataset to length `data_size`.\n", " # If there are fewer than `data_size` rows in the data, fill with a constant.\n", " # If there are more than `data_size` rows in the data, only keep `data_size` rows\n", " make_bounded_resize(size=data_size, bounds=age_bounds, constant=age_prior) >>\n", " # Compute the mean\n", " make_sized_bounded_mean(size=data_size, bounds=age_bounds)\n", " )\n", " return binary_search_chain(\n", " lambda s: mean_chain >> make_base_laplace(scale=s),\n", " d_in=1, d_out=epsilon, \n", " bounds=(0., 10.))\n", "\n", "@lru_cache(maxsize=None)\n", "def make_sum_with(*, epsilon):\n", " bounded_age_sum = (\n", " # Clamp income values\n", " make_clamp(bounds=age_bounds) >>\n", " # These bounds must be identical to the clamp bounds, otherwise chaining will fail\n", " make_bounded_sum(bounds=age_bounds)\n", " )\n", " return binary_search_chain(\n", " lambda s: bounded_age_sum >> make_base_laplace(scale=s),\n", " d_in=1, d_out=epsilon,\n", " bounds=(0., 1000.))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "\n", "In this simulation, we are running the same procedure `n_simulations` times. In each iteration, we collect the estimated count and mean." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Status:\n", "0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%\n" ] } ], "source": [ "import random\n", "\n", "n_simulations = 1000\n", "\n", "history_count = []\n", "history_mean = []\n", "\n", "print(\"Status:\")\n", "for i in range(n_simulations):\n", " if i % 100 == 0:\n", " print(f\"{i / n_simulations:.0%} \", end=\"\")\n", " \n", " # See https://github.com/opendp/opendp/issues/357\n", " random.shuffle(age)\n", " \n", " count_chain = make_count_with(epsilon=0.05)\n", " history_count.append(count_chain(age))\n", " \n", " mean_chain = make_mean_with(data_size=history_count[-1], epsilon=1.)\n", " history_mean.append(mean_chain(age))\n", "\n", "print(\"100%\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now we plot our simulation data, with counts on the X axis and means on the Y axis." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import statistics\n", "\n", "true_mean_age = statistics.mean(age)\n", "\n", "# The light blue circles are DP means\n", "plt.plot(history_count, history_mean, 'o', fillstyle='none', color = 'cornflowerblue')\n", "\n", "def compute_expected_mean(count):\n", " count = max(count, size)\n", " return ((true_mean_age * size) + (count - size) * age_prior) / count\n", "\n", "expected_count = list(range(min(history_count), max(history_count)))\n", "expected_mean = list(map(compute_expected_mean, expected_count))\n", "\n", "# The dark blue dots are the average DP mean per dataset size\n", "for count in expected_count:\n", " sims = [m for c, m in zip(history_count, history_mean) if c == count]\n", " if len(sims) > 6:\n", " plt.plot(count, statistics.mean(sims), 'o', color = 'indigo')\n", "\n", "# The red line is the expected value by dp release of dataset size\n", "plt.plot(expected_count, expected_mean, linestyle='--', color = 'tomato')\n", "plt.ylabel('DP Release of Age')\n", "plt.xlabel('DP estimate of row count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this plot, the red dashed line is the expected outcome,\n", "and each of the points represents a `(count, mean)` tuple from one iteration of the simulation.\n", "Due to the behavior of the resize preprocess transformation,\n", "underestimated counts lead to higher variance means,\n", "and overestimated counts bias the mean closer to the imputation constant.\n", "On the other hand, underestimated counts are unbiased, and overestimated counts have reduced variance.\n", "Keep in mind that it is valid to postprocess the count to be smaller,\n", "reduce the likelihood of introducing bias by imputing.\n", "If the count is overestimated, the amount of bias introduced to the statistic\n", "by imputation when resizing depends on how much the count estimate differs from the true dataset count,\n", "and how much the imputation constant differs from the true dataset mean.\n", "Since both of these quantities are private (and unknowable), they are not accounted for in accuracy estimates." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "In the next plot, we see the range of DP means calculated as a function of the resized row count.\n", "Note that the range of possible DP mean values decreases as the resized count increases, and that the DP mean gets\n", "closer to the prior for the true value: 38." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "\n", "releases = []\n", "# X axis ticks\n", "n_range = range(100, 2001, 200)\n", "# Number of samples per boxplot\n", "n_simulations = 50\n", "\n", "for n in n_range:\n", " mean_chain = make_mean_with(data_size=n, epsilon=1.)\n", " for index in range(n_simulations):\n", " # See https://github.com/opendp/opendp/issues/357\n", " random.shuffle(age)\n", " \n", " # get mean of age at the given n\n", " releases.append((n, mean_chain(age)))\n", "\n", "# get released values\n", "df = pd.DataFrame.from_records(releases, columns=['resize to row count', 'DP mean'])\n", "\n", "# The boxplots show the distribution of releases per n\n", "plot = sns.boxplot(x = 'resize to row count', y = 'DP mean', data = df)\n", "# The blue line is the true mean\n", "plot.axhline(true_mean_age)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The results from this approach have a similar interpretation as in the prior plot.\n", "Underestimated counts lead to higher variance means,\n", "and overestimated counts lead to greater bias in means.\n", "Thankfully, the count is a low-sensitivity query, so count estimates are usually very close to the true count." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OpenDP `resize` vs. other approaches\n", "The standard formula for the mean of a variable is:\n", "$\\bar{x} = \\frac{\\sum{x}}{n}$\n", "\n", "The conventional, and simpler, approach in the differential privacy literature, is to: \n", "\n", "1. compute a DP sum of the variable for the numerator\n", "2. compute a DP count of the dataset rows for the denominator\n", "3. take their ratio\n", "\n", "This is sometimes called a 'plug-in' approach, as we are plugging-in differentially private answers for each of the\n", "terms in the original formula, without any additional modifications, and using the resulting answer as our\n", "estimate while ignoring the noise processes of differential privacy. While this 'plug-in' approach does result in a\n", "differentially private value, the utility here is generally lower than the solution in OpenDP. Because the number of\n", "terms summed in the numerator does not agree with the value in the denominator, the variance is increased and the\n", "resulting distribution becomes both biased and asymmetrical, which is visually noticeable in smaller samples." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Status:\n", "0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%\n" ] } ], "source": [ "n_simulations = 1_000\n", "history_plugin = []\n", "history_resize = []\n", "\n", "# sized estimators are more robust to noisy counts, so epsilon is small\n", "# the less epsilon provided to this count, the more the result will be biased towards the prior\n", "resize_count = make_count_with(epsilon=0.2)\n", "\n", "# plugin estimators want a much more accurate count\n", "plugin_count = make_count_with(epsilon=0.5)\n", "plugin_sum = make_sum_with(epsilon=0.5)\n", "\n", "print(\"Status:\")\n", "for i in range(n_simulations):\n", " if i % 100 == 0:\n", " print(f\"{i / n_simulations:.0%} \", end=\"\")\n", " \n", " # See https://github.com/opendp/opendp/issues/357\n", " random.shuffle(age)\n", "\n", " history_plugin.append(plugin_sum(age) / plugin_count(age))\n", "\n", " resize_mean = make_mean_with(data_size=resize_count(age), epsilon=.8)\n", " history_resize.append(resize_mean(age))\n", " \n", "print('100%')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots()\n", "sns.kdeplot(history_resize, fill=True, linewidth=3,\n", " label = 'Resize Mean')\n", "sns.kdeplot(history_plugin, fill=True, linewidth=3,\n", " label = 'Plug-in Mean')\n", "\n", "ax.plot([true_mean_age,true_mean_age], [0,2], linestyle='--', color = 'forestgreen')\n", "plt.xlabel('DP Release of Age')\n", "leg = ax.legend()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We have noticed that for the same privacy loss,\n", "the distribution of answers from OpenDP's resizing approach to the mean is tighter around the true dataset value (thus lower in error) than the conventional plug-in approach.\n", "\n", "*Note, in these simulations, we've shown equal division of the epsilon for all constituent releases,\n", "but higher utility (lower error) can be generally gained by moving more of the epsilon into the sum,\n", "and using less in the count of the dataset rows, as in earlier examples.*" ] } ], "metadata": { "interpreter": { "hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34" }, "kernelspec": { "display_name": "Python 3.8.13 ('psi')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 2 }