\n",
" ## Elements of a Boolean Randomized Response Measurement

\n",
"\n",
"1. We first define the **function** $f(\\cdot)$, that applies the randomized response to some boolean argument $x$. The function returns the correct answer with probability `prob`, otherwise it flips the answer.\n",
"\n",
"$$f(x) = x \\wedge \\neg \\mathrm{sample\\_bernoulli}(prob)$$\n",
"\n",
"2. $f(\\cdot)$ is only well-defined for boolean inputs. This (small) set of permitted inputs is described by the **input domain** (denoted `AtomDomain`).\n",
"\n",
"3. The set of possible outputs is described by the **output domain** (also `AtomDomain`).\n",
"\n",
"4. Randomized response has a privacy guarantee in terms of epsilon. \n",
"This guarantee is represented by a **privacy map**, a function that computes the privacy loss $\\epsilon$ for any choice of sensitivity $\\Delta$.\n",
"\n",
"$$map(d_{in}) = d_{in} \\cdot \\ln(\\mathrm{prob} / (1 - \\mathrm{prob}))$$\n",
"\n",
"5. This map requires that $d_{in}$ be a discrete distance, which is either 0 if the elements are the same, or 1 if the elements are different. This is used as the **input metric** (`DiscreteDistance`).\n",
"\n",
"6. We similarly describe units on the output ($\\epsilon$) via the **output measure** (`MaxDivergence`).\n",
"

\n",
"\n",
"`make_randomized_response_bool` returns the equivalent measurement:"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"noisy release: True\n",
"epsilon: 1.0986122886681098\n"
]
}
],
"source": [
"from opendp.measurements import make_randomized_response_bool\n",
"\n",
"# construct the measurement\n",
"rr_bool_meas = make_randomized_response_bool(prob=0.75)\n",
"\n",
"# invoke the measurement on some survey response to execute the randomized response algorithm\n",
"alice_survey_response = True\n",
"print(\"noisy release:\", rr_bool_meas(alice_survey_response))\n",
"\n",
"# determine epsilon by invoking the map\n",
"print(\"epsilon:\", rr_bool_meas.map(d_in=1))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A simple generalization of the previous algorithm is to randomize over a custom category set:\n"
]
},
{
"cell_type": "code",
"execution_count": 322,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"noisy release: D\n",
"epsilon: 2.1972245773362196\n"
]
}
],
"source": [
"from opendp.measurements import make_randomized_response\n",
"\n",
"# construct the measurement\n",
"categories = [\"A\", \"B\", \"C\", \"D\"]\n",
"rr_meas = make_randomized_response(categories, prob=0.75)\n",
"\n",
"# invoke the measurement on some survey response, to execute the randomized response algorithm\n",
"alice_survey_response = \"C\"\n",
"print(\"noisy release:\", rr_meas(alice_survey_response))\n",
"\n",
"# determine epsilon by invoking the map\n",
"print(\"epsilon:\", rr_meas.map(d_in=1))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Aggregation: Mean\n",
"\n",
"The privatized responses from many individuals may be aggregated to form a population-level inference.\n",
"In the case of the boolean randomized response, you may want to estimate the proportion of individuals who actually responded with `True`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 313,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.364"
]
},
"execution_count": 313,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"num_responses = 1000\n",
"\n",
"true_probability = .23\n",
"\n",
"private_bool_responses = []\n",
"\n",
"for _ in range(num_responses):\n",
" response = bool(np.random.binomial(n=1, p=true_probability))\n",
" randomized_response = rr_bool_meas(response)\n",
" private_bool_responses.append(randomized_response)\n",
"\n",
"naive_proportion = np.mean(private_bool_responses)\n",
"naive_proportion"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We know the true probability is .23, so our estimate is off!\n",
"\n",
"The naive proportions can be corrected for bias via the following derivation:\n",
"\n",
"\n",
" ## Derivation of Boolean RR Bias Correction

\n",
"\n",
"We want an unbiased estimate of $\\frac{\\sum X_i}{n}$.\n",
"Denote the randomized response $Y_i = \\texttt{rr\\_bool\\_meas}(X_i)$.\n",
"We first find the expectation of $Y_i$:\n",
"$$\\begin{align*}\n",
" E[Y_i] &= p X_i + (1 - p) (1 - X_i) \\\\\n",
" &= p X_i + p X_i - p - X_i + 1 \\\\\n",
" &= (2 p - 1) X_i - p + 1\n",
"\\end{align*}$$\n",
"\n",
"This can be used as an unbiased estimator for the proportion of true answers:\n",
"\n",
"$$\\begin{align*}\n",
" E[X_i] = \\frac{E[Y_i] + p - 1}{2 p - 1}\n",
"\\end{align*}$$\n",
"\n",
"

\n",
"\n",
"\n",
"The resulting expression is distilled into the following function:"
]
},
{
"cell_type": "code",
"execution_count": 314,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.22799999999999976"
]
},
"execution_count": 314,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def debias_randomized_response_bool(mean_release, p):\n",
" \"\"\"Adjust for the bias of the mean of a boolean RR dataset.\"\"\"\n",
" assert 0 <= mean_release <= 1\n",
" assert 0 <= p <= 1\n",
" \n",
" return (mean_release + p - 1) / (2 * p - 1)\n",
"\n",
"estimated_bool_proportion = debias_randomized_response_bool(naive_proportion, .75)\n",
"estimated_bool_proportion"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the bias correction admits a useful estimate of the population proportion (`.23`).\n",
"\n",
"The categorical randomized response will suffer the same bias:"
]
},
{
"cell_type": "code",
"execution_count": 315,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.165, 0.349, 0.284, 0.202]"
]
},
"execution_count": 315,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"num_responses = 1000\n",
"\n",
"true_probability = [0.1, 0.4, 0.3, 0.2]\n",
"\n",
"private_cat_responses = []\n",
"\n",
"for _ in range(num_responses):\n",
" response = np.random.choice(categories, p=true_probability)\n",
" randomized_response = rr_meas(response)\n",
" private_cat_responses.append(randomized_response)\n",
"\n",
"from collections import Counter\n",
"\n",
"counter = Counter(private_cat_responses)\n",
"naive_cat_proportions = [counter[cat] / num_responses for cat in categories]\n",
"naive_cat_proportions"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We can do the same analysis to de-bias the categorical estimate:\n",
"\n",
"\n",
" ## Derivation of Categorical RR Bias Correction

\n",
"\n",
"Denote the randomized response $Y_i = \\texttt{rr\\_meas}(X_i)$, and the $k^{th}$ category as $C_k$.\n",
"\n",
"We first find $E[I[Y_i = C_k]]$ (the expectation that noisy responses are equal to the $k^{th}$ category). \n",
"This is done by considering the law of total probability over all categories.\n",
"\n",
"$$\\begin{align*}\n",
" E[I[Y_i = C_k]] &= p \\cdot I[X_i = C_k] + \\sum_{j \\ne k} \\frac{1 - p}{K - 1} \\cdot I[X_i = C_j] \\\\\n",
" &= p \\cdot I[X_i = C_k] + \\frac{1 - p}{K - 1} \\cdot (1 - I[X_i = C_k])\n",
"\\end{align*}$$\n",
"\n",
"Then solve for $E[I[X_i = C_k]]$ (the expectation that raw responses are equal to the $k^{th}$ category):\n",
"\n",
"$$\\begin{align*}\n",
" E[I[Y_i = C_k]] (K - 1) &= p \\cdot E[I[X_i = C_k]] (K - 1) + (1 - p)(1 - E[I[X_i = C_k]]) \\\\\n",
" E[I[Y_i = C_k]] (K - 1) &= p \\cdot E[I[X_i = C_k]] K - p - E[I[X_i = C_k]] + 1 \\\\\n",
" E[I[Y_i = C_k]] (K - 1) + p - 1 &= E[I[X_i = C_k]] (pK - 1) \\\\\n",
" \\frac{E[I[Y_i = C_k]] (K - 1) + p - 1}{pK - 1} &= E[I[X_i = C_k]]\n",
"\\end{align*}$$\n",
"\n",
"

\n",
"\n",
"This formula is represented in the following function:"
]
},
{
"cell_type": "code",
"execution_count": 316,
"metadata": {},
"outputs": [],
"source": [
"def debias_randomized_response(mean_releases, p):\n",
" \"\"\"Adjust for the bias of the mean of a categorical RR dataset.\"\"\"\n",
" mean_releases = np.array(mean_releases)\n",
" assert all(mean_releases >= 0) and abs(sum(mean_releases) - 1) < 1e-6\n",
" assert 0 <= p <= 1\n",
" \n",
" k = len(mean_releases)\n",
" return (mean_releases * (k - 1) + p - 1) / (p * k - 1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We similarly estimate population parameters in the categorical setting:"
]
},
{
"cell_type": "code",
"execution_count": 317,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"true probability: [0.1, 0.4, 0.3, 0.2]\n",
"estimated probability: [0.123, 0.398, 0.301, 0.178]\n"
]
}
],
"source": [
"estimated_cat_proportions = debias_randomized_response(naive_cat_proportions, .75)\n",
"\n",
"print(\"true probability:\", true_probability)\n",
"print(\"estimated probability:\", list(estimated_cat_proportions.round(3)))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Aggregation: Count\n",
"\n",
"Just like the mean was biased, so is a simple count of responses for each category:"
]
},
{
"cell_type": "code",
"execution_count": 323,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biased boolean count: 364\n",
"biased categorical count: {'A': 165, 'B': 349, 'C': 284, 'D': 202}\n"
]
}
],
"source": [
"print(\"biased boolean count:\", np.sum(private_bool_responses))\n",
"print(\"biased categorical count:\", dict(sorted(Counter(private_cat_responses).items())))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the dataset size is known, simply post-process the mean estimates:"
]
},
{
"cell_type": "code",
"execution_count": 320,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unbiased boolean count: 227\n",
"unbiased categorical count: {'A': 122, 'B': 398, 'C': 300, 'D': 178}\n"
]
}
],
"source": [
"estimated_bool_count = int(estimated_bool_proportion * num_responses)\n",
"estimated_cat_count = dict(zip(categories, (estimated_cat_proportions * num_responses).astype(int)))\n",
"\n",
"print(\"unbiased boolean count:\", estimated_bool_count)\n",
"print(\"unbiased categorical count:\", estimated_cat_count)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "psi",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13 (default, Mar 16 2022, 20:38:07) \n[Clang 13.0.0 (clang-1300.0.29.30)]"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}