{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Differencing\n", "\n", "In this notebook, we will examine perhaps the simplest possible attack on an individual's private data and what the OpenDP library can do to mitigate it.\n", "\n", "## Loading the data\n", "\n", "The vetting process is currently underway for the code in the OpenDP Library.\n", "Any constructors that have not been vetted may still be accessed if you opt-in to \"contrib\"." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import opendp.prelude as dp\n", "dp.enable_features('contrib')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "We begin with loading up the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%[ -e data.csv ] || wget https://raw.githubusercontent.com/opendp/opendp/main/docs/source/data/PUMS_california_demographics_1000/data.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['age', 'sex', 'educ', 'race', 'income', 'married']\n", "59,1,9,1,0,1\n", "31,0,1,3,17000,0\n", "36,1,11,1,0,1\n", "54,1,11,1,9100,1\n", "39,0,5,3,37000,0\n", "34,0,9,1,0,1\n" ] } ], "source": [ "with open('data.csv') as input_file:\n", " data = input_file.read()\n", "\n", "col_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n", "print(col_names)\n", "print('\\n'.join(data.split('\\n')[:6]))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The following code parses the data into a vector of incomes.\n", "More details on preprocessing can be found [here](../pums-data-analysis.ipynb)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.0, 17000.0, 0.0, 9100.0, 37000.0, 0.0, 6000.0]\n" ] } ], "source": [ "income_preprocessor = (\n", " # Convert data into a dataframe where columns are of type Vec\n", " dp.t.make_split_dataframe(separator=\",\", col_names=col_names) >>\n", " # Selects a column of df, Vec\n", " dp.t.make_select_column(key=\"income\", TOA=str)\n", ")\n", "\n", "# make a transformation that casts from a vector of strings to a vector of floats\n", "cast_str_float = (\n", " # Cast Vec to Vec>\n", " dp.t.then_cast(TOA=float) >>\n", " # Replace any elements that failed to parse with 0., emitting a Vec\n", " dp.t.then_impute_constant(0.)\n", ")\n", "\n", "# replace the previous preprocessor: extend it with the caster\n", "income_preprocessor = income_preprocessor >> cast_str_float\n", "incomes = income_preprocessor(data)\n", "\n", "print(incomes[:7])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## A simple attack\n", "\n", "Say there's an attacker who's target is the income of the first person in our data (i.e. the first income in the csv). In our case, its simply `0` (but any number is fine, i.e. 5000)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "person of interest:\n", "\n", "0.0\n" ] } ], "source": [ "person_of_interest = incomes[0]\n", "print('person of interest:\\n\\n{0}'.format(person_of_interest))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now consider the case an attacker that doesn't know the POI income, but do know the following: (1) the average income without the POI income, and (2) the number of persons in the database.\n", "As we show next, if he would also get the average income (including the POI's one), by simple manipulation he can easily back out the individual's income." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "poi_income: 0.0\n" ] } ], "source": [ "# attacker information: everyone's else mean, and their count.\n", "exact_mean = np.mean(incomes[1:])\n", "known_obs = len(incomes) - 1\n", "\n", "# assume the attackers know legitimately get the overall mean (and hence can infer the total count)\n", "overall_mean = np.mean(incomes)\n", "n_obs = len(incomes)\n", "\n", "# back out POI's income\n", "poi_income = overall_mean * n_obs - known_obs * exact_mean\n", "print('poi_income: {0}'.format(poi_income))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The attacker now knows with certainty that the POI has an income of $0.\n", "\n", "\n", "## Using OpenDP\n", "What happens if the attacker were made to interact with the data through OpenDP and was given a privacy budget of $\\epsilon = 1$?\n", "We will assume that the attacker is reasonably familiar with differential privacy and believes that they should use tighter data bounds than they would anticipate being in the data in order to get a less noisy estimate.\n", "They will need to update their `known_mean` accordingly." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DP mean: 27660.389303281972\n", "Exact mean: 29805.858585858587\n" ] } ], "source": [ "max_influence = 1\n", "count_release = 100\n", "\n", "income_bounds = (0.0, 100_000.0)\n", "\n", "clamp_and_resize_data = (\n", " (dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance()) >>\n", " dp.t.then_clamp(bounds=income_bounds) >>\n", " dp.t.then_resize(size=count_release, constant=10_000.0)\n", ")\n", "\n", "exact_mean = np.mean(clamp_and_resize_data(incomes)[1:])\n", "\n", "mean_measurement = (\n", " clamp_and_resize_data >>\n", " dp.t.then_mean() >>\n", " dp.m.then_laplace(scale=1.0)\n", ")\n", "\n", "dp_mean = mean_measurement(incomes)\n", "\n", "print(\"DP mean:\", dp_mean)\n", "print(\"Exact mean:\", exact_mean)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "We will be using `n_sims` to simulate the process a number of times to get a sense for various possible outcomes for the attacker.\n", "In practice, they would see the result of only one simulation." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Known Mean Income (after truncation): 29805.858585858587\n", "Observed Mean Income: 28913.242083100424\n", "Estimated POI Income: -59455.79168995719\n", "True POI Income: 0.0\n" ] } ], "source": [ "# initialize vector to store estimated overall means\n", "n_sims = 1_000\n", "n_queries = 1\n", "poi_income_ests = []\n", "estimated_means = []\n", "\n", "# get estimates of overall means\n", "for i in range(n_sims):\n", " query_means = [mean_measurement(incomes) for j in range(n_queries)]\n", "\n", " # get estimates of POI income\n", " estimated_means.append(np.mean(query_means))\n", " poi_income_ests.append(estimated_means[i] * count_release - (count_release - 1) * exact_mean)\n", "\n", "\n", "# get mean of estimates\n", "print('Known Mean Income (after truncation): {0}'.format(exact_mean))\n", "print('Observed Mean Income: {0}'.format(np.mean(estimated_means)))\n", "print('Estimated POI Income: {0}'.format(np.mean(poi_income_ests)))\n", "print('True POI Income: {0}'.format(person_of_interest))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "We see empirically that, in expectation, the attacker can get a reasonably good estimate of POI's income. However, they will rarely (if ever) get it exactly and would have no way of knowing if they did.\n", "\n", "In our case, indeed the mean estimated POI income approaches the true income, as the number of simulations `n_sims` increases.\n", "Below is a plot showing the empirical distribution of estimates of POI income. Notice about its concentration around `0`, and the Laplacian curve of the graph." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import warnings\n", "import seaborn as sns\n", "\n", "# hide warning created by outstanding scipy.stats issue\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "# distribution of POI income\n", "ax = sns.distplot(poi_income_ests, kde = False, hist_kws = dict(edgecolor = 'black', linewidth = 1))\n", "ax.set(xlabel = 'Estimated POI income');" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": "Python 3.8.13 ('psi')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3, "vscode": { "interpreter": { "hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34" } } }, "nbformat": 4, "nbformat_minor": 2 }