{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Randomized Response\n", "\n", "Randomized response is used to release categorical survey responses in the local-DP, or per-user, model.\n", "The randomized response algorithm is typically meant to be run on the edge, at the user's device, before data is submitted to a central server.\n", "Local DP is a stronger privacy model than the central model, because the central data aggregator is only ever privileged to privatized data.\n", "\n", "OpenDP currently only provides mechanisms that may be run on the edge device:\n", "You must handle network communication and aggregation.\n", "\n", "--------\n", "\n", "Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to \"contrib\".\n", "Please contact us if you are interested in proof-writing. Thank you!" ] }, { "cell_type": "code", "execution_count": 310, "metadata": {}, "outputs": [], "source": [ "from opendp.mod import enable_features\n", "enable_features(\"contrib\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Privatization\n", "\n", "We'll start by privatizing a boolean response. Boolean randomized response is fully characterized by a _measurement_ containing the following six elements:\n", "\n", "
\n", " Elements of a Boolean Randomized Response Measurement\n", "\n", "1. We first define the **function** $f(\\cdot)$, that applies the randomized response to some boolean argument $x$. The function returns the correct answer with probability `prob`, otherwise it flips the answer.\n", "\n", "$$f(x) = x \\wedge \\neg \\mathrm{sample\\_bernoulli}(prob)$$\n", "\n", "2. $f(\\cdot)$ is only well-defined for boolean inputs. This (small) set of permitted inputs is described by the **input domain** (denoted `AtomDomain`).\n", "\n", "3. The set of possible outputs is described by the **output domain** (also `AtomDomain`).\n", "\n", "4. Randomized response has a privacy guarantee in terms of epsilon. \n", "This guarantee is represented by a **privacy map**, a function that computes the privacy loss $\\epsilon$ for any choice of sensitivity $\\Delta$.\n", "\n", "$$map(d_{in}) = d_{in} \\cdot \\ln(\\mathrm{prob} / (1 - \\mathrm{prob}))$$\n", "\n", "5. This map requires that $d_{in}$ be a discrete distance, which is either 0 if the elements are the same, or 1 if the elements are different. This is used as the **input metric** (`DiscreteDistance`).\n", "\n", "6. We similarly describe units on the output ($\\epsilon$) via the **output measure** (`MaxDivergence`).\n", "
\n", "\n", "`make_randomized_response_bool` returns the equivalent measurement:" ] }, { "cell_type": "code", "execution_count": 311, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "noisy release: True\n", "epsilon: 1.0986122886681098\n" ] } ], "source": [ "from opendp.measurements import make_randomized_response_bool\n", "\n", "# construct the measurement\n", "rr_bool_meas = make_randomized_response_bool(prob=0.75)\n", "\n", "# invoke the measurement on some survey response to execute the randomized response algorithm\n", "alice_survey_response = True\n", "print(\"noisy release:\", rr_bool_meas(alice_survey_response))\n", "\n", "# determine epsilon by invoking the map\n", "print(\"epsilon:\", rr_bool_meas.map(d_in=1))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A simple generalization of the previous algorithm is to randomize over a custom category set:\n" ] }, { "cell_type": "code", "execution_count": 322, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "noisy release: D\n", "epsilon: 2.1972245773362196\n" ] } ], "source": [ "from opendp.measurements import make_randomized_response\n", "\n", "# construct the measurement\n", "categories = [\"A\", \"B\", \"C\", \"D\"]\n", "rr_meas = make_randomized_response(categories, prob=0.75)\n", "\n", "# invoke the measurement on some survey response, to execute the randomized response algorithm\n", "alice_survey_response = \"C\"\n", "print(\"noisy release:\", rr_meas(alice_survey_response))\n", "\n", "# determine epsilon by invoking the map\n", "print(\"epsilon:\", rr_meas.map(d_in=1))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation: Mean\n", "\n", "The privatized responses from many individuals may be aggregated to form a population-level inference.\n", "In the case of the boolean randomized response, you may want to estimate the proportion of individuals who actually responded with `True`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 313, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.364" ] }, "execution_count": 313, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "num_responses = 1000\n", "\n", "true_probability = .23\n", "\n", "private_bool_responses = []\n", "\n", "for _ in range(num_responses):\n", " response = bool(np.random.binomial(n=1, p=true_probability))\n", " randomized_response = rr_bool_meas(response)\n", " private_bool_responses.append(randomized_response)\n", "\n", "naive_proportion = np.mean(private_bool_responses)\n", "naive_proportion # pyright: ignore" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We know the true probability is .23, so our estimate is off!\n", "\n", "The naive proportions can be corrected for bias via the following derivation:\n", "\n", "
\n", " Derivation of Boolean RR Bias Correction\n", "\n", "We want an unbiased estimate of $\\frac{\\sum X_i}{n}$.\n", "Denote the randomized response $Y_i = \\texttt{rr\\_bool\\_meas}(X_i)$.\n", "We first find the expectation of $Y_i$:\n", "$$\\begin{align*}\n", " E[Y_i] &= p X_i + (1 - p) (1 - X_i) \\\\\n", " &= p X_i + p X_i - p - X_i + 1 \\\\\n", " &= (2 p - 1) X_i - p + 1\n", "\\end{align*}$$\n", "\n", "This can be used as an unbiased estimator for the proportion of true answers:\n", "\n", "$$\\begin{align*}\n", " E[X_i] = \\frac{E[Y_i] + p - 1}{2 p - 1}\n", "\\end{align*}$$\n", "\n", "
\n", "\n", "\n", "The resulting expression is distilled into the following function:" ] }, { "cell_type": "code", "execution_count": 314, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.22799999999999976" ] }, "execution_count": 314, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def debias_randomized_response_bool(mean_release, p):\n", " \"\"\"Adjust for the bias of the mean of a boolean RR dataset.\"\"\"\n", " assert 0 <= mean_release <= 1\n", " assert 0 <= p <= 1\n", " \n", " return (mean_release + p - 1) / (2 * p - 1)\n", "\n", "estimated_bool_proportion = debias_randomized_response_bool(naive_proportion, .75)\n", "estimated_bool_proportion" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the bias correction admits a useful estimate of the population proportion (`.23`).\n", "\n", "The categorical randomized response will suffer the same bias:" ] }, { "cell_type": "code", "execution_count": 315, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.165, 0.349, 0.284, 0.202]" ] }, "execution_count": 315, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "num_responses = 1000\n", "\n", "true_probability = [0.1, 0.4, 0.3, 0.2]\n", "\n", "private_cat_responses = []\n", "\n", "for _ in range(num_responses):\n", " response = np.random.choice(categories, p=true_probability)\n", " randomized_response = rr_meas(response)\n", " private_cat_responses.append(randomized_response)\n", "\n", "from collections import Counter\n", "\n", "counter = Counter(private_cat_responses)\n", "naive_cat_proportions = [counter[cat] / num_responses for cat in categories]\n", "naive_cat_proportions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We can do the same analysis to de-bias the categorical estimate:\n", "\n", "
\n", " Derivation of Categorical RR Bias Correction\n", "\n", "Denote the randomized response $Y_i = \\texttt{rr\\_meas}(X_i)$, and the $k^{th}$ category as $C_k$.\n", "\n", "We first find $E[I[Y_i = C_k]]$ (the expectation that noisy responses are equal to the $k^{th}$ category). \n", "This is done by considering the law of total probability over all categories.\n", "\n", "$$\\begin{align*}\n", " E[I[Y_i = C_k]] &= p \\cdot I[X_i = C_k] + \\sum_{j \\ne k} \\frac{1 - p}{K - 1} \\cdot I[X_i = C_j] \\\\\n", " &= p \\cdot I[X_i = C_k] + \\frac{1 - p}{K - 1} \\cdot (1 - I[X_i = C_k])\n", "\\end{align*}$$\n", "\n", "Then solve for $E[I[X_i = C_k]]$ (the expectation that raw responses are equal to the $k^{th}$ category):\n", "\n", "$$\\begin{align*}\n", " E[I[Y_i = C_k]] (K - 1) &= p \\cdot E[I[X_i = C_k]] (K - 1) + (1 - p)(1 - E[I[X_i = C_k]]) \\\\\n", " E[I[Y_i = C_k]] (K - 1) &= p \\cdot E[I[X_i = C_k]] K - p - E[I[X_i = C_k]] + 1 \\\\\n", " E[I[Y_i = C_k]] (K - 1) + p - 1 &= E[I[X_i = C_k]] (pK - 1) \\\\\n", " \\frac{E[I[Y_i = C_k]] (K - 1) + p - 1}{pK - 1} &= E[I[X_i = C_k]]\n", "\\end{align*}$$\n", "\n", "
\n", "\n", "This formula is represented in the following function:" ] }, { "cell_type": "code", "execution_count": 316, "metadata": {}, "outputs": [], "source": [ "def debias_randomized_response(mean_releases, p):\n", " \"\"\"Adjust for the bias of the mean of a categorical RR dataset.\"\"\"\n", " mean_releases = np.array(mean_releases)\n", " assert all(mean_releases >= 0) and abs(sum(mean_releases) - 1) < 1e-6\n", " assert 0 <= p <= 1\n", " \n", " k = len(mean_releases)\n", " return (mean_releases * (k - 1) + p - 1) / (p * k - 1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We similarly estimate population parameters in the categorical setting:" ] }, { "cell_type": "code", "execution_count": 317, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "true probability: [0.1, 0.4, 0.3, 0.2]\n", "estimated probability: [0.123, 0.398, 0.301, 0.178]\n" ] } ], "source": [ "estimated_cat_proportions = debias_randomized_response(naive_cat_proportions, .75)\n", "\n", "print(\"true probability:\", true_probability)\n", "print(\"estimated probability:\", list(estimated_cat_proportions.round(3)))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation: Count\n", "\n", "Just like the mean was biased, so is a simple count of responses for each category:" ] }, { "cell_type": "code", "execution_count": 323, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "biased boolean count: 364\n", "biased categorical count: {'A': 165, 'B': 349, 'C': 284, 'D': 202}\n" ] } ], "source": [ "print(\"biased boolean count:\", np.sum(private_bool_responses))\n", "print(\"biased categorical count:\", dict(sorted(Counter(private_cat_responses).items())))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Since the dataset size is known, simply post-process the mean estimates:" ] }, { "cell_type": "code", "execution_count": 320, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unbiased boolean count: 227\n", "unbiased categorical count: {'A': 122, 'B': 398, 'C': 300, 'D': 178}\n" ] } ], "source": [ "estimated_bool_count = int(estimated_bool_proportion * num_responses)\n", "estimated_cat_count = dict(zip(categories, (estimated_cat_proportions * num_responses).astype(int)))\n", "\n", "print(\"unbiased boolean count:\", estimated_bool_count)\n", "print(\"unbiased categorical count:\", estimated_cat_count)" ] } ], "metadata": { "kernelspec": { "display_name": "psi", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13 (default, Mar 16 2022, 20:38:07) \n[Clang 13.0.0 (clang-1300.0.29.30)]" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34" } } }, "nbformat": 4, "nbformat_minor": 2 }