{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Privatizing Histograms\n", "\n", "Sometimes we want to release the counts of individual outcomes in a dataset.\n", "When plotted, this makes a histogram.\n", "\n", "The library currently has two approaches:\n", "\n", "1. Known category set `make_count_by_categories`\n", "2. Unknown category set `make_count_by`\n", "\n", "The next code block imports handles boilerplate: imports, data loading, plotting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%[ -e data.csv ] || wget https://raw.githubusercontent.com/opendp/opendp/main/docs/source/data/PUMS_california_demographics_1000/data.csv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import opendp.prelude as dp\n", "dp.enable_features(\"contrib\", \"floating-point\")\n", "max_influence = 1\n", "budget = (1., 1e-8)\n", "\n", "# public information\n", "col_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n", "size = 1000\n", "\n", "with open('data.csv') as input_data:\n", " data = input_data.read()\n", "\n", "def plot_histogram(sensitive_counts, released_counts):\n", " \"\"\"Plot a histogram that compares true data against released data\"\"\"\n", " import matplotlib.pyplot as plt\n", " import matplotlib.ticker as ticker\n", "\n", " fig = plt.figure()\n", " ax = fig.add_axes([1,1,1,1])\n", " plt.ylim([0,225])\n", " tick_spacing = 1.\n", " ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))\n", " plt.xlim(0,15)\n", " width = .4\n", "\n", " ax.bar(list([x+width for x in range(0, len(sensitive_counts))]), sensitive_counts, width=width, label='True Value')\n", " ax.bar(list([x+2*width for x in range(0, len(released_counts))]), released_counts, width=width, label='DP Value')\n", " ax.legend()\n", " plt.title('Histogram of Education Level')\n", " plt.xlabel('Years of Education')\n", " plt.ylabel('Count')\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Private histogram via `make_count_by_categories`\n", "\n", "This approach is only applicable if the set of potential values that the data may take on is public information.\n", "If this information is not available, then use `make_count_by` instead.\n", "It typically has greater utility than `make_count_by` until the size of the category set is greater than dataset size.\n", "In this data, we know that the category set is public information:\n", "strings consisting of the numbers between 1 and 20.\n", "\n", "The counting aggregator computes a vector of counts in the same order as the input categories.\n", "It also includes one extra count at the end of the vector,\n", "consisting of the number of elements that were not members of the category set.\n", "\n", "You'll notice that `make_base_discrete_laplace` has an additional argument that explicitly sets the type of the domain, `D`.\n", "It defaults to `AtomDomain[int]` which works in situations where the mechanism is noising a scalar.\n", "However, in this situation, we are noising a vector of scalars,\n", "and thus the appropriate domain is `VectorDomain[AtomDomain[int]]`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Educational level counts:\n", " [33, 14, 38, 17, 24, 21, 31, 51, 201, 60, 165, 76, 178, 54, 24, 13, 0, 0, 0]\n", "DP Educational level counts:\n", " [34, 13, 38, 18, 24, 22, 31, 51, 201, 59, 166, 76, 178, 54, 24, 13, 0, 0, 3]\n", "DP estimate for the number of records that were not a member of the category set: 1\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# public information\n", "categories = list(map(str, range(1, 20)))\n", "\n", "histogram = (\n", " dp.t.make_split_dataframe(separator=\",\", col_names=col_names) >>\n", " dp.t.make_select_column(key=\"educ\", TOA=str) >>\n", " # Compute counts for each of the categories and null\n", " dp.t.then_count_by_categories(categories=categories)\n", ")\n", "\n", "noisy_histogram = dp.binary_search_chain(\n", " lambda s: histogram >> dp.m.then_laplace(scale=s),\n", " d_in=max_influence, d_out=budget[0])\n", "\n", "sensitive_counts = histogram(data)\n", "released_counts = noisy_histogram(data)\n", "\n", "print(\"Educational level counts:\\n\", sensitive_counts[:-1])\n", "print(\"DP Educational level counts:\\n\", released_counts[:-1])\n", "\n", "print(\"DP estimate for the number of records that were not a member of the category set:\", released_counts[-1])\n", "\n", "plot_histogram(sensitive_counts, released_counts)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Private histogram via `make_count_by` and `make_base_laplace_threshold`\\n\n", "This approach is applicable when the set of categories is unknown or very large.\n", "The `make_count_by` transformation computes a hashmap containing the count of each unique key,\n", "and `make_base_laplace_threshold` adds noise to the counts and censors counts less than some threshold.\\n\n", "\n", "On `make_base_laplace_threshold`, the noise scale parameter influences the epsilon parameter of the budget, \n", "and the threshold influences the delta parameter in the budget.\n", "Any category with a count sufficiently small is censored from the release.\n", "\n", "It is sometimes referred to as a \"stability histogram\" because it only releases counts for \"stable\" categories that exist in all datasets that are considered \"neighboring\" to your private dataset.\n", "\n", "I start out by defining a function that finds the tightest noise scale and threshold for which the stability histogram is `(d_in, d_out)`-close." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "def make_base_laplace_threshold_budget(\n", " preprocess: dp.Transformation,\n", " d_in, d_out\n", ") -> dp.Measurement:\n", " \"\"\"Make a stability histogram that respects a given d_in, d_out.\"\"\"\n", " def privatize(s, t=1e8):\n", " return preprocess >> dp.m.then_base_laplace_threshold(scale=s, threshold=t)\n", " \n", " s = dp.binary_search_param(lambda s: privatize(s=s), d_in, d_out)\n", " t = dp.binary_search_param(lambda t: privatize(s=s, t=t), d_in, d_out)\n", "\n", " return privatize(s=s, t=t)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "I now use the `make_base_laplace_threshold_budget` constructor to release a private histogram on the education data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Educational level counts:\n", " [33, 14, 38, 17, 24, 21, 31, 51, 201, 60, 165, 76, 178, 54, 24, 13, 0, 0, 0, 0]\n", "DP Educational level counts:\n", " {'3': 39, '10': 60, '1': 32, '11': 160, '13': 178, '7': 31, '14': 56, '5': 23, '8': 51, '12': 77, '6': 23, '9': 201, '15': 24}\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "preprocess = (\n", " dp.t.make_split_dataframe(separator=\",\", col_names=col_names) >>\n", " dp.t.make_select_column(key=\"educ\", TOA=str) >>\n", " dp.t.then_count_by(MO=dp.L1Distance[float], TV=float)\n", ")\n", "\n", "noisy_histogram = make_base_laplace_threshold_budget(\n", " preprocess,\n", " d_in=max_influence, d_out=budget)\n", "\n", "sensitive_counts = histogram(data)\n", "released_counts = noisy_histogram(data)\n", "# postprocess to make the results easier to compare\n", "postprocessed_counts = {k: round(v) for k, v in released_counts.items()}\n", "\n", "print(\"Educational level counts:\\n\", sensitive_counts)\n", "print(\"DP Educational level counts:\\n\", postprocessed_counts)\n", "\n", "def as_array(data):\n", " return [data.get(k, 0) for k in categories]\n", "\n", "plot_histogram(sensitive_counts, as_array(released_counts))" ] } ], "metadata": { "interpreter": { "hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34" }, "kernelspec": { "display_name": "Python 3.8.13 ('psi')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 1 }