{ "cells": [ { "cell_type": "markdown", "id": "63a3bd92", "metadata": {}, "source": [ "# Approximate Laplace Projection\n", "\n", "When dealing with data that has an unknown key-set, the keys themselves need to be protected. \n", "One approach (shown in this section) is to release a differentially private low-dimensional projection of the key-space.\n", "The mechanism releases a queryable that can be queried with keys to retrieve noisy count estimates.\n", "\n", "Two other commonly-used approaches to deal with unknown key-set are \n", "\"explicit\" key release, where only pre-specified keys are released, and \n", "\"stable\" key release, where only keys contributed by many individuals are released.\n", "(Stable key release uses thresholded noise mechanisms.)" ] }, { "cell_type": "markdown", "id": "0361851a", "metadata": {}, "source": [ "----\n", "Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to \"contrib\".\n", "Please contact us if you are interested in proof-writing. Thank you!" ] }, { "cell_type": "code", "execution_count": 1, "id": "0d3e1966", "metadata": {}, "outputs": [], "source": [ "import opendp.prelude as dp\n", "dp.enable_features(\"contrib\")" ] }, { "cell_type": "markdown", "id": "e48f695f", "metadata": {}, "source": [ "The mechanism takes as input a hash map, \n", "typically where the keys represent grouping keys, \n", "and values represent the number of occurences of those keys in the dataset.\n", "\n", "In this example, imagine you have a dataset where the keys correspond to the first names of individuals, and values correspond to the number of people with that name in a dataset." ] }, { "cell_type": "code", "execution_count": 2, "id": "60528e4d", "metadata": {}, "outputs": [], "source": [ "# Based on real-world name frequencies. \n", "# The data could just as well consist of many hundreds of more names.\n", "names = {'Michael': 750, 'James': 500, 'Sharon': 50}\n", "input_domain = dp.map_domain(dp.atom_domain(T=str), dp.atom_domain(T=int))" ] }, { "cell_type": "markdown", "id": "f4f8806d", "metadata": {}, "source": [ "Just like the thresholded noise mechanism, \n", "the sensitivity is expressed in terms of a triple:\n", "\n", "* $\\Delta_0$: how many values an individual may influence\n", "* $\\Delta_1$: the total influence an individual may have over all values\n", "* $\\Delta_\\infty$: the influence an individual may have on any one value\n", "\n", "Naturally, in this setting, all three of these quantities are one:" ] }, { "cell_type": "code", "execution_count": 3, "id": "b9d11dd8", "metadata": {}, "outputs": [], "source": [ "sensitivity = 1, 1, 1\n", "input_metric = dp.l01inf_distance(dp.absolute_distance(T=int))" ] }, { "cell_type": "markdown", "id": "b71358ea", "metadata": {}, "source": [ "A key part of the mechanism is determining how large the projected space should be.\n", "The size of the projected space is associated with a memory/utility tradeoff,\n", "where a larger projected space costs more memory, but retains greater representational capacity.\n", "If the projected space is too small, then many more strings will hash to the same value,\n", "resulting in a higher likelihood of hash collisions.\n", "The size of the projected space should ideally be large enough that it is unlikely \n", "more than one frequent name hashes to the same value.\n", "\n", "The two most important factors that influence the size of the projected space are:\n", "\n", "* `total_limit`: upper-bound the sum of values.\n", "* `value_limit`: upper-bound each value. Required if the input domain is not already bounded." ] }, { "cell_type": "code", "execution_count": 4, "id": "fdd1bf29", "metadata": {}, "outputs": [], "source": [ "epsilon = 1.0\n", "\n", "m_alp = dp.binary_search_chain(\n", " lambda s: dp.m.make_alp_queryable(\n", " input_domain, input_metric, scale=s, total_limit=50000, value_limit=1000\n", " ),\n", " d_in=sensitivity,\n", " d_out=epsilon,\n", ")\n", "\n", "qbl = m_alp(names)" ] }, { "cell_type": "markdown", "id": "0f2b0c70", "metadata": {}, "source": [ "This mechanism releases a queryable containing a differentially private, \n", "hash-based representation of the counts of all possible names." ] }, { "cell_type": "code", "execution_count": 5, "id": "f2e6796f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(736.0, 496.0, 52.0, 0.0)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "qbl(\"Michael\"), qbl(\"James\"), qbl(\"Sharon\"), qbl(\"Lancelot\")" ] }, { "cell_type": "markdown", "id": "d046da93", "metadata": {}, "source": [ "These counts roughly correspond to the input data. \n", "\n", "Notice that these counts work out to multiples of four.\n", "To reduce the size of the projection, \n", "the precision of answers in the compressed representation is controlled via a parameter `alpha`,\n", "which has a default of four.\n", "\n", "Finally, the size of the projected space can be scaled via the `size_multiplier` argument,\n", "which is set to a default of fifty.\n", "`alpha` and `size_multiplier`, together with `total_limit` and `value_limit`,\n", "comprise a heuristic to determine a reasonable domain size." ] } ], "metadata": { "kernelspec": { "display_name": ".venv (3.13.7)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }