{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Differencing Attack\n", "\n", "This section demonstrates some of the simplest possible attacks on an individual's private data and what differential privacy does to mitigate it.\n", "We'll demonstrate the attack with public-use California income microdata." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import opendp.prelude as dp\n", "dp.enable_features('contrib') # OpenDP is vetting new features, so we need to enable them explicitly\n", "\n", "data_path = dp.examples.get_california_pums_path()\n", "var_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n", "incomes = np.genfromtxt(data_path, delimiter=',', names=var_names)[:]['income'].tolist() # type: ignore" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Say an attacker wants to know the income of the first person in our data (i.e. the first income in the csv). \n", "In our case, it happens to be `0`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_income = incomes[0]\n", "target_income" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "One way the attacker could deduce the income of the target individual is by acquiring the following information:\n", "1. the number of individuals in the dataset\n", "2. the average income\n", "3. the average income without the target individual" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# attacker information:\n", "n_individuals = len(incomes)\n", "mean = float(np.mean(incomes))\n", "mean_non_target = float(np.mean(incomes[1:]))\n", "\n", "def reconstruct_income(n_individuals, mean, mean_non_target):\n", " \"\"\"Reconstruct the target's income from the mean and the mean of the non-targets.\"\"\"\n", " return mean * n_individuals - (n_individuals - 1) * mean_non_target\n", "\n", "recovered_income = reconstruct_income(n_individuals, mean, mean_non_target)\n", "recovered_income" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "These queries seem more benign than directly requesting the target's income.\n", "The most suspicious of the three (mean without the target individual) \n", "could easily be hidden with more benign predicates that _just so happen_ to exclude one individual.\n", "\n", "In general, it is impossible to anticipate how more complex combinations of queries could be used to violate the privacy of an individual in the data.\n", "A further complicating factor is that data curators have no way of knowing what information adversaries are advantaged with,\n", "when the data curator chooses whether to answer a query.\n", "\n", "## Differential Privacy\n", "\n", "Differential privacy mathematically guarantees that data released to the adversary \n", "will only increase the adversary's knowledge about any one individual by a small amount.\n", "Therefore when the privacy parameters are appropriately tuned (a rule of thumb being $\\epsilon = 1$), \n", "the adversary won't be able to infer the income of the target,\n", "even when the adversary has access to unlimited auxiliary information.\n", "\n", "Let's set up a mediator that the adversary can query that ensures the privacy loss will not exceed $\\epsilon = 1$." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "\n", "context = dp.Context.compositor(\n", " data=pl.scan_csv(\n", " data_path,\n", " with_column_names=lambda _: var_names,\n", " infer_schema_length=None,\n", " ),\n", " privacy_unit=dp.unit_of(contributions=1),\n", " # allows the privacy loss of up to one epsilon for each individual in the data\n", " privacy_loss=dp.loss_of(epsilon=1.0),\n", " # the adversary will be able to ask two queries\n", " split_evenly_over=2,\n", " # in this case, it is public info that there are at most 1000 individuals in the data\n", " margins=[dp.polars.Margin(max_length=1000)]\n", ")\n", "# all further data access will be mediated by the context\n", "del data_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start by assuming the adversary knows the number of individuals and average income.\n", "The adversary also knows the target, so they know the target's age is 59 and has education status 9, \n", "which is enough to single out the target.\n", "The adversary now sneakily requests the mean income for _everyone else_." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2586576.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import polars as pl\n", "\n", "query = (\n", " context.query()\n", " # matches everyone except the target\n", " .filter((pl.col.age != 59).or_(pl.col.educ != 9))\n", " .select(pl.col(\"income\").cast(int).dp.sum((0, 200_000)))\n", ")\n", "dp_non_target_income = query.release().collect().item()\n", "\n", "dp_target_income = reconstruct_income(\n", " n_individuals,\n", " mean,\n", " # if the adversary uses the DP release to reconstruct the income, ...\n", " mean_non_target=dp_non_target_income / (n_individuals - 1),\n", ")\n", "# ...then the recovered income will be wildly inaccurate\n", "dp_target_income" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even when the adversary was advantaged with auxiliary information, \n", "and all they needed was the total non-target income,\n", "they still couldn't use the DP release to get a good estimate of the income.\n", "\n", "Nonetheless, for an honest analyst, the estimate of the non-target income has reasonable utility:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Non-DP mean income: 34414.4984984985\n", " DP mean income: 31825.333333333332\n" ] } ], "source": [ "print(\"Non-DP mean income:\", mean_non_target)\n", "print(\" DP mean income:\", dp_non_target_income / (n_individuals - 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if the analyst directly queries for the person of interest's income?\n", "They plug in the known attributes as predicates, like age and education status, to single out the target.\n", "They even make the greatest possible income level smaller, based on a best guess, to reduce the noise:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "86414" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = (\n", " context.query()\n", " .filter((pl.col.age == 59).and_(pl.col.educ == 9))\n", " .select(pl.col(\"income\").cast(int).dp.sum((0, 70_000)))\n", ")\n", "dp_target_income = query.release().collect().item()\n", "dp_target_income" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though this query was explicitly crafted to single out the target, \n", "the estimate is still wildly inaccurate.\n", "\n", "## Distribution of Outcomes\n", "To demonstrate what differential privacy does to protect our data,\n", "let's now take off the adversary hat, and put on the student hat.\n", "As a student interested in learning about DP, and who is _not_ working with sensitive data, \n", "let's re-run the release multiple times to reveal the distribution of `dp_target_income`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "\n", "m_target_income = query.resolve()\n", "data = pl.scan_csv(\n", " dp.examples.get_california_pums_path(),\n", " with_column_names=lambda _: var_names,\n", " infer_schema_length=None,\n", ")\n", "\n", "# get estimates of overall means\n", "dp_target_incomes = [m_target_income(data).collect().item() for _ in range(1_000)]\n", "ax = sns.histplot(dp_target_incomes, edgecolor = 'black', linewidth = 1)\n", "ax.set(xlabel = 'Estimated Target Incomes');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how estimates of the estimates of the target's income vary wildly, \n", "but the answers are much more concentrated around the true income.\n", "The adversary, only ever seeing one of these simulated releases, \n", "will have practically no better knowledge of the target's income than they did before,\n", "even when the mechanism happens to release zero.\n", "\n", "This is because the target can always appeal:\n", "> My income wasn't actually zero! It was the noise! I practically made six figures!" ] } ], "metadata": { "file_extension": ".py", "kernelspec": { "display_name": ".venv (3.13.7)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3 }, "nbformat": 4, "nbformat_minor": 2 }