{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping\n", "\n", "The OpenDP Library allows you to compute statistics over grouped data.\n", "We'll examine three approaches used to release queries that involve grouping:\n", "\n", "- Stable keys\n", "- Explicit keys\n", "- Invariant keys\n", "\n", "The [API Reference](../../api/python/opendp.extras.polars.rst) provides more information about the methods. \n", "We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/main/data/eurostat/README.ipynb) from the Labour Force Survey in France." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": "(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n const py_version = '3.6.1'.replace('rc', '-rc.').replace('.dev', '-dev.');\n const reloading = false;\n const Bokeh = root.Bokeh;\n\n // Set a timeout for this load but only if we are not already initializing\n if (typeof (root._bokeh_timeout) === \"undefined\" || (force || !root._bokeh_is_initializing)) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks;\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, js_modules, js_exports, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n if (js_modules == null) js_modules = [];\n if (js_exports == null) js_exports = {};\n\n root._bokeh_onload_callbacks.push(callback);\n\n if (root._bokeh_is_loading > 0) {\n // Don't load bokeh if it is still initializing\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n } else if (js_urls.length === 0 && js_modules.length === 0 && Object.keys(js_exports).length === 0) {\n // There is nothing to load\n run_callbacks();\n return null;\n }\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n window._bokeh_on_load = on_load\n\n function on_error(e) {\n const src_el = e.srcElement\n console.error(\"failed to load \" + (src_el.href || src_el.src));\n }\n\n const skip = [];\n if (window.requirejs) {\n window.requirejs.config({'packages': {}, 'paths': {}, 'shim': {}});\n root._bokeh_is_loading = css_urls.length + 0;\n } else {\n root._bokeh_is_loading = css_urls.length + js_urls.length + js_modules.length + Object.keys(js_exports).length;\n }\n\n const existing_stylesheets = []\n const links = document.getElementsByTagName('link')\n for (let i = 0; i < links.length; i++) {\n const link = links[i]\n if (link.href != null) {\n existing_stylesheets.push(link.href)\n }\n }\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const escaped = encodeURI(url)\n if (existing_stylesheets.indexOf(escaped) !== -1) {\n on_load()\n continue;\n }\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n } var existing_scripts = []\n const scripts = document.getElementsByTagName('script')\n for (let i = 0; i < scripts.length; i++) {\n var script = scripts[i]\n if (script.src != null) {\n existing_scripts.push(script.src)\n }\n }\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const escaped = encodeURI(url)\n if (skip.indexOf(escaped) !== -1 || existing_scripts.indexOf(escaped) !== -1) {\n if (!window.requirejs) {\n on_load();\n }\n continue;\n }\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (let i = 0; i < js_modules.length; i++) {\n const url = js_modules[i];\n const escaped = encodeURI(url)\n if (skip.indexOf(escaped) !== -1 || existing_scripts.indexOf(escaped) !== -1) {\n if (!window.requirejs) {\n on_load();\n }\n continue;\n }\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (const name in js_exports) {\n const url = js_exports[name];\n const escaped = encodeURI(url)\n if (skip.indexOf(escaped) >= 0 || root[name] != null) {\n if (!window.requirejs) {\n on_load();\n }\n continue;\n }\n var element = document.createElement('script');\n element.onerror = on_error;\n element.async = false;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n element.textContent = `\n import ${name} from \"${url}\"\n window.${name} = ${name}\n window._bokeh_on_load()\n `\n document.head.appendChild(element);\n }\n if (!js_urls.length && !js_modules.length) {\n on_load()\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.holoviz.org/panel/1.5.4/dist/bundled/reactiveesm/es-module-shims@^1.10.0/dist/es-module-shims.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-3.6.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.6.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.6.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.6.1.min.js\", \"https://cdn.holoviz.org/panel/1.5.4/dist/panel.min.js\"];\n const js_modules = [];\n const js_exports = {};\n const css_urls = [];\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {} // ensure no trailing comma for IE\n ];\n\n function run_inline_js() {\n if ((root.Bokeh !== undefined) || (force === true)) {\n for (let i = 0; i < inline_js.length; i++) {\n try {\n inline_js[i].call(root, root.Bokeh);\n } catch(e) {\n if (!reloading) {\n throw e;\n }\n }\n }\n // Cache old bokeh versions\n if (Bokeh != undefined && !reloading) {\n var NewBokeh = root.Bokeh;\n if (Bokeh.versions === undefined) {\n Bokeh.versions = new Map();\n }\n if (NewBokeh.version !== Bokeh.version) {\n Bokeh.versions.set(NewBokeh.version, NewBokeh)\n }\n root.Bokeh = Bokeh;\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n }\n root._bokeh_is_initializing = false\n }\n\n function load_or_wait() {\n // Implement a backoff loop that tries to ensure we do not load multiple\n // versions of Bokeh and its dependencies at the same time.\n // In recent versions we use the root._bokeh_is_initializing flag\n // to determine whether there is an ongoing attempt to initialize\n // bokeh, however for backward compatibility we also try to ensure\n // that we do not start loading a newer (Panel>=1.0 and Bokeh>3) version\n // before older versions are fully initialized.\n if (root._bokeh_is_initializing && Date.now() > root._bokeh_timeout) {\n // If the timeout and bokeh was not successfully loaded we reset\n // everything and try loading again\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_is_initializing = false;\n root._bokeh_onload_callbacks = undefined;\n root._bokeh_is_loading = 0\n console.log(\"Bokeh: BokehJS was loaded multiple times but one version failed to initialize.\");\n load_or_wait();\n } else if (root._bokeh_is_initializing || (typeof root._bokeh_is_initializing === \"undefined\" && root._bokeh_onload_callbacks !== undefined)) {\n setTimeout(load_or_wait, 100);\n } else {\n root._bokeh_is_initializing = true\n root._bokeh_onload_callbacks = []\n const bokeh_loaded = root.Bokeh != null && (root.Bokeh.version === py_version || (root.Bokeh.versions !== undefined && root.Bokeh.versions.has(py_version)));\n if (!reloading && !bokeh_loaded) {\n if (root.Bokeh) {\n root.Bokeh = undefined;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n }\n load_libs(css_urls, js_urls, js_modules, js_exports, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n }\n // Give older versions of the autoload script a head-start to ensure\n // they initialize before we start loading newer version.\n setTimeout(load_or_wait, 100)\n}(window));", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": "\nif ((window.PyViz === undefined) || (window.PyViz instanceof HTMLElement)) {\n window.PyViz = {comms: {}, comm_status:{}, kernels:{}, receivers: {}, plot_index: []}\n}\n\n\n function JupyterCommManager() {\n }\n\n JupyterCommManager.prototype.register_target = function(plot_id, comm_id, msg_handler) {\n if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n comm_manager.register_target(comm_id, function(comm) {\n comm.on_msg(msg_handler);\n });\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n window.PyViz.kernels[plot_id].registerCommTarget(comm_id, function(comm) {\n comm.onMsg = msg_handler;\n });\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n google.colab.kernel.comms.registerTarget(comm_id, (comm) => {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n console.log(message)\n var content = {data: message.data, comm_id};\n var buffers = []\n for (var buffer of message.buffers || []) {\n buffers.push(new DataView(buffer))\n }\n var metadata = message.metadata || {};\n var msg = {content, buffers, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n })\n }\n }\n\n JupyterCommManager.prototype.get_client_comm = function(plot_id, comm_id, msg_handler) {\n if (comm_id in window.PyViz.comms) {\n return window.PyViz.comms[comm_id];\n } else if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n var comm = comm_manager.new_comm(comm_id, {}, {}, {}, comm_id);\n if (msg_handler) {\n comm.on_msg(msg_handler);\n }\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n var comm = window.PyViz.kernels[plot_id].connectToComm(comm_id);\n comm.open();\n if (msg_handler) {\n comm.onMsg = msg_handler;\n }\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n var comm_promise = google.colab.kernel.comms.open(comm_id)\n comm_promise.then((comm) => {\n window.PyViz.comms[comm_id] = comm;\n if (msg_handler) {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n var content = {data: message.data};\n var metadata = message.metadata || {comm_id};\n var msg = {content, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n }\n }) \n var sendClosure = (data, metadata, buffers, disposeOnDone) => {\n return comm_promise.then((comm) => {\n comm.send(data, metadata, buffers, disposeOnDone);\n });\n };\n var comm = {\n send: sendClosure\n };\n }\n window.PyViz.comms[comm_id] = comm;\n return comm;\n }\n window.PyViz.comm_manager = new JupyterCommManager();\n \n\n\nvar JS_MIME_TYPE = 'application/javascript';\nvar HTML_MIME_TYPE = 'text/html';\nvar EXEC_MIME_TYPE = 'application/vnd.holoviews_exec.v0+json';\nvar CLASS_NAME = 'output';\n\n/**\n * Render data to the DOM node\n */\nfunction render(props, node) {\n var div = document.createElement(\"div\");\n var script = document.createElement(\"script\");\n node.appendChild(div);\n node.appendChild(script);\n}\n\n/**\n * Handle when a new output is added\n */\nfunction handle_add_output(event, handle) {\n var output_area = handle.output_area;\n var output = handle.output;\n if ((output.data == undefined) || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n return\n }\n var id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n if (id !== undefined) {\n var nchildren = toinsert.length;\n var html_node = toinsert[nchildren-1].children[0];\n html_node.innerHTML = output.data[HTML_MIME_TYPE];\n var scripts = [];\n var nodelist = html_node.querySelectorAll(\"script\");\n for (var i in nodelist) {\n if (nodelist.hasOwnProperty(i)) {\n scripts.push(nodelist[i])\n }\n }\n\n scripts.forEach( function (oldScript) {\n var newScript = document.createElement(\"script\");\n var attrs = [];\n var nodemap = oldScript.attributes;\n for (var j in nodemap) {\n if (nodemap.hasOwnProperty(j)) {\n attrs.push(nodemap[j])\n }\n }\n attrs.forEach(function(attr) { newScript.setAttribute(attr.name, attr.value) });\n newScript.appendChild(document.createTextNode(oldScript.innerHTML));\n oldScript.parentNode.replaceChild(newScript, oldScript);\n });\n if (JS_MIME_TYPE in output.data) {\n toinsert[nchildren-1].children[1].textContent = output.data[JS_MIME_TYPE];\n }\n output_area._hv_plot_id = id;\n if ((window.Bokeh !== undefined) && (id in Bokeh.index)) {\n window.PyViz.plot_index[id] = Bokeh.index[id];\n } else {\n window.PyViz.plot_index[id] = null;\n }\n } else if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n var bk_div = document.createElement(\"div\");\n bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n var script_attrs = bk_div.children[0].attributes;\n for (var i = 0; i < script_attrs.length; i++) {\n toinsert[toinsert.length - 1].childNodes[1].setAttribute(script_attrs[i].name, script_attrs[i].value);\n }\n // store reference to server id on output_area\n output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n }\n}\n\n/**\n * Handle when an output is cleared or removed\n */\nfunction handle_clear_output(event, handle) {\n var id = handle.cell.output_area._hv_plot_id;\n var server_id = handle.cell.output_area._bokeh_server_id;\n if (((id === undefined) || !(id in PyViz.plot_index)) && (server_id !== undefined)) { return; }\n var comm = window.PyViz.comm_manager.get_client_comm(\"hv-extension-comm\", \"hv-extension-comm\", function () {});\n if (server_id !== null) {\n comm.send({event_type: 'server_delete', 'id': server_id});\n return;\n } else if (comm !== null) {\n comm.send({event_type: 'delete', 'id': id});\n }\n delete PyViz.plot_index[id];\n if ((window.Bokeh !== undefined) & (id in window.Bokeh.index)) {\n var doc = window.Bokeh.index[id].model.document\n doc.clear();\n const i = window.Bokeh.documents.indexOf(doc);\n if (i > -1) {\n window.Bokeh.documents.splice(i, 1);\n }\n }\n}\n\n/**\n * Handle kernel restart event\n */\nfunction handle_kernel_cleanup(event, handle) {\n delete PyViz.comms[\"hv-extension-comm\"];\n window.PyViz.plot_index = {}\n}\n\n/**\n * Handle update_display_data messages\n */\nfunction handle_update_output(event, handle) {\n handle_clear_output(event, {cell: {output_area: handle.output_area}})\n handle_add_output(event, handle)\n}\n\nfunction register_renderer(events, OutputArea) {\n function append_mime(data, metadata, element) {\n // create a DOM node to render to\n var toinsert = this.create_output_subarea(\n metadata,\n CLASS_NAME,\n EXEC_MIME_TYPE\n );\n this.keyboard_manager.register_events(toinsert);\n // Render to node\n var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n render(props, toinsert[0]);\n element.append(toinsert);\n return toinsert\n }\n\n events.on('output_added.OutputArea', handle_add_output);\n events.on('output_updated.OutputArea', handle_update_output);\n events.on('clear_output.CodeCell', handle_clear_output);\n events.on('delete.Cell', handle_clear_output);\n events.on('kernel_ready.Kernel', handle_kernel_cleanup);\n\n OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n safe: true,\n index: 0\n });\n}\n\nif (window.Jupyter !== undefined) {\n try {\n var events = require('base/js/events');\n var OutputArea = require('notebook/js/outputarea').OutputArea;\n if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n register_renderer(events, OutputArea);\n }\n } catch(err) {\n }\n}\n", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ] }, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "8cca09aa-9bff-4ce3-92e6-e2e44026d099" } }, "output_type": "display_data" } ], "source": [ "import polars as pl\n", "import opendp.prelude as dp\n", "import hvplot.polars\n", "\n", "dp.enable_features(\"contrib\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stable Keys\n", "\n", "Partition keys can be extremely sensitive. \n", "For example, sharing a noisy count query grouped by social security number or credit card number \n", "would catastrophically violate the privacy of individuals in the data.\n", "Similarly, re-identification of individuals can be accomplished \n", "with very small combinations of grouping columns that may seem benign\n", "(for reference, see how [Latanya Sweeney reidentified the Governor of Massachusetts in healthcare data](https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/)).\n", "\n", "For this reason, by default, the OpenDP Library discards any data partitions that may be unique to an individual\n", "by filtering out partitions with too few records.\n", "This is why the set of released partitions are considered \"stable\": \n", "the algorithm only releases partitions that remain stable (won't disappear) when any one individual is removed.\n", "OpenDP calibrates the filtering threshold such that the probability of releasing a partition with one individual\n", "is no greater than the privacy parameter delta (δ).\n", "\n", "In the following example, only counts for combinations of year and sex that are common among many workers are released. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", " # so we'll set `ignore_errors` to true to avoid conversion errors.\n", " data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 4, delta=1e-7),\n", " # allow for one query\n", " split_evenly_over=1,\n", ")\n", "\n", "query_age_ilostat = (\n", " context.query()\n", " .group_by(\"AGE\", \"ILOSTAT\")\n", " .agg(dp.len())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before releasing the query, let's take a look at properties we can expect of the output:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 6)
columnaggregatedistributionscaleaccuracythreshold
strstrstrf64f64u32
"len""Frame Length""Integer Laplace"144.0431.8845792773
" ], "text/plain": [ "shape: (1, 6)\n", "┌────────┬──────────────┬─────────────────┬───────┬────────────┬───────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy ┆ threshold │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ u32 │\n", "╞════════╪══════════════╪═════════════════╪═══════╪════════════╪═══════════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 144.0 ┆ 431.884579 ┆ 2773 │\n", "└────────┴──────────────┴─────────────────┴───────┴────────────┴───────────┘" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_age_ilostat.summarize(alpha=.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Judging from the query description, \n", "noisy partition lengths returned by this query will differ by no more than the given accuracy with 1 - alpha = 95% confidence.\n", "Any partition with a noisy partition length below the given threshold will be filtered from the release.\n", "\n", "This level of utility seems suitable, so we'll go ahead and release the query:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Overlay\n", " .NdOverlay.I :NdOverlay [ILOSTAT]\n", " :Curve [AGE] (len)\n", " .NdOverlay.II :NdOverlay [ILOSTAT]\n", " :Scatter [AGE] (len)" ] }, "execution_count": 4, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "5af1ad3b-ceb9-4382-91ea-00a6d0c36dc4" } }, "output_type": "execute_result" } ], "source": [ "df = query_age_ilostat.release().collect()\n", "\n", "line = df.sort(\"AGE\").hvplot.line(x=\"AGE\", y=\"len\", by=\"ILOSTAT\")\n", "scatter = df.sort(\"AGE\").hvplot.scatter(x=\"AGE\", y=\"len\", by=\"ILOSTAT\")\n", "line * scatter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Referring back to the ILOSTAT legend:\n", "\n", "| ILOSTAT | Legend | Comments | \n", "| ------- | ------ | ----------- | \n", "| 1 | did any work for pay or profit | most common among people between 30 and 50 |\n", "| 2 | employed but not working | slightly more commonly observed among young adults |\n", "| 3 | was not working because of lay-off | clearly influenced by retirement |\n", "| 9 | not applicable, less than 15 years old | only present for the youngest age group |\n", "\n", "Where points are missing in the graph, \n", "there are not enough individuals for that combination of age and employment status to pass threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explicit Keys\n", "\n", "If you know partitions ahead-of-time, you can avoid spending the privacy parameter $\\delta$ to release them.\n", "\n", "> It is recommended to only ever create one Context that spans all queries you may make on your data.\n", "> We create another context here to demonstrate how grouping queries can be released without the use of the privacy parameter delta.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", " # so we'll set `ignore_errors` to true to avoid conversion errors.\n", " data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 4),\n", " # allow for one query\n", " split_evenly_over=1,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, you can reuse the stable partition keys released in the previous query:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"144.0
" ], "text/plain": [ "shape: (1, 4)\n", "┌────────┬──────────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞════════╪══════════════╪═════════════════╪═══════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 144.0 │\n", "└────────┴──────────────┴─────────────────┴───────┘" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_age_ilostat = (\n", " context.query()\n", " .group_by(\"AGE\", \"ILOSTAT\")\n", " .agg(dp.len())\n", " .with_keys(df[\"AGE\", \"ILOSTAT\"])\n", ")\n", "\n", "query_age_ilostat.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.with_keys` adds a left join where the left dataset is your explicit list of partition keys \n", "and the right dataset contains the results of the grouped aggregation.\n", "You can also write the join yourself! When using the context API, this is easier to express through a right join:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"144.0
" ], "text/plain": [ "shape: (1, 4)\n", "┌────────┬──────────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞════════╪══════════════╪═════════════════╪═══════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 144.0 │\n", "└────────┴──────────────┴─────────────────┴───────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_age_ilostat = (\n", " context.query()\n", " .group_by(\"AGE\", \"ILOSTAT\")\n", " .agg(dp.len())\n", " .join(df[\"AGE\", \"ILOSTAT\"].lazy(), how=\"right\", on=[\"AGE\", \"ILOSTAT\"])\n", ")\n", "\n", "query_age_ilostat.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The OpenDP Library rewrites these kinds of queries to impute any missing statistics corresponding to explicit partition keys that don't exist in the real data.\n", "The imputed values are as if you released the differentially private statistics on an empty data partition." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Invariant Keys\n", "\n", "The OpenDP Library also allows you to explicitly describe partition keys as \n", "\"public information\" that can be released in the clear.\n", "\n", "Be aware that any aspect of your data labeled \"public information\" or an \"invariant\" is not subject to privacy protections.\n", "For this reason, you should be very reluctant to use invariants.\n", "Remember that even the absence of a partition can constitute a privacy violation.\n", "Nevertheless, this approach has seen use in high-profile data releases, including by the US Census Bureau.\n", "\n", "For example, in the Eurostat data, you may consider the quarters in which data has been collected to be public information.\n", "You can mark partition keys for any combination of year and quarter as invariant by setting `public_info=\"keys\"` as follows:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 4),\n", " split_evenly_over=1,\n", " margins={\n", " # partition keys when grouped by \"YEAR\" and \"QUARTER\" are invariant\n", " (\"YEAR\", \"QUARTER\"): dp.polars.Margin(\n", " public_info=\"keys\",\n", " )\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> It is recommended to only ever create one Context that spans all queries you may make on your data.\n", "> We create another context here to demonstrate how margins will influence the analysis.\n", "\n", "Due to the existence of this margin descriptor, the library will now release these quarterly keys in the clear." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"len""Frame Length""Integer Laplace"144.0431.884579
" ], "text/plain": [ "shape: (1, 5)\n", "┌────────┬──────────────┬─────────────────┬───────┬────────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 │\n", "╞════════╪══════════════╪═════════════════╪═══════╪════════════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 144.0 ┆ 431.884579 │\n", "└────────┴──────────────┴─────────────────┴───────┴────────────┘" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_quarterly_counts = (\n", " context.query()\n", " .group_by(\"YEAR\", \"QUARTER\")\n", " .agg(dp.len())\n", ")\n", "\n", "summary = query_quarterly_counts.summarize(alpha=.05)\n", "summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query description no longer has a \"threshold\" field: all noisy statistics computed on each data partition will be released." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This visualization includes error bars showing 95% confidence intervals for the true value by pulling the accuracy estimate from the query description above." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Overlay\n", " .Curve.I :Curve [date] (len)\n", " .ErrorBars.I :ErrorBars [date] (len,accuracy)" ] }, "execution_count": 10, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "69878488-2416-492d-98c1-067ceb35bc99" } }, "output_type": "execute_result" } ], "source": [ "df = query_quarterly_counts.release().collect()\n", "\n", "# build a date column\n", "df = df.with_columns(pl.date(pl.col(\"YEAR\"), pl.col(\"QUARTER\") * 4, 1))\n", "\n", "line = df.hvplot.line(x=\"date\", y=\"len\")\n", "errorbars = df.with_columns(accuracy=summary[\"accuracy\"][0]) \\\n", " .hvplot.errorbars(x=\"date\", y=\"len\", yerr1=\"accuracy\")\n", "line * errorbars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the noise scale and accuracy estimate is the same as in the previous protected group keys query,\n", "the relative error is now much larger because the group sizes are much smaller.\n", "In spite of this, the release clearly still shows that the number of respondents increased significantly from 2008 to 2010." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invariant Partition Lengths\n", "\n", "It is also possible to declare partition sizes as public information (a data invariant).\n", "Setting this value implies that partition keys are public, \n", "and thus implies an even greater risk of damaging the integrity of the privacy guarantee \n", "than if you were to just specify partition keys as public.\n", "\n", "One way this could be used is as part of a release for the mean number of hours worked `HWUSUAL` by sex.\n", "Referring back to the `HWUSUAL` legend, a value of `99` means not applicable,\n", "and this encoding will significantly bias the outcome of the query.\n", "\n", "Unfortunately, filtering the data within the query results in the margin info being invalidated.\n", "One way to work around this limitation is to preprocess your data before passing it into the context:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "lf_preprocessed = pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True) \\\n", " .filter(pl.col(\"HWUSUAL\") < 99)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now set up your analysis such that the margin applies to the preprocessed data:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " data=lf_preprocessed,\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),\n", " split_evenly_over=1,\n", " margins={\n", " # total number of responses when grouped by \"SEX\" is public information\n", " (\"SEX\",): dp.polars.Margin(\n", " public_info=\"lengths\",\n", " max_partition_length=60_000_000, # population of France\n", " max_num_partitions=1,\n", " )\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now prepare a query that computes mean working hours by gender, where each response is clipped between 0 and 98:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Integer Laplace"7056.021138.386904
"HWUSUAL""Length""Integer Laplace"0.0NaN
" ], "text/plain": [ "shape: (2, 5)\n", "┌─────────┬───────────┬─────────────────┬────────┬──────────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 │\n", "╞═════════╪═══════════╪═════════════════╪════════╪══════════════╡\n", "│ HWUSUAL ┆ Sum ┆ Integer Laplace ┆ 7056.0 ┆ 21138.386904 │\n", "│ HWUSUAL ┆ Length ┆ Integer Laplace ┆ 0.0 ┆ NaN │\n", "└─────────┴───────────┴─────────────────┴────────┴──────────────┘" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "query_work_hours = (\n", " context.query()\n", " .group_by(\"SEX\")\n", " .agg(pl.col.HWUSUAL.cast(int).fill_null(0).dp.mean((0, 98)))\n", ")\n", "\n", "query_work_hours.summarize(alpha=.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time the query description breaks down into two separate statistics for the resulting `HWUSUAL` column.\n", "The mean is computed by dividing separate sum and length estimates.\n", "Since the partition length is marked as public information, the length is released in the clear, without any noise.\n", "\n", "Reading the table, the smallest suitable noise scale parameter for the sum is 7056. \n", "This may seem large, but remember this noisy sum will be divided by the length, reducing the variance of the final estimate.\n", "You can divide the sum accuracy estimate by the public partition sizes to get disaggregated accuracy estimates for each partition. \n", "\n", "Be mindful that accuracy estimates for sums and means do not take into account bias introduced from clipping.\n", "However, for this specific dataset, \n", "the codebook limits the range of work hours in `HWUSUAL` to between 0 and 98,\n", "so the clipping introduced for differential privacy will not further increase bias." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 3)
SEXSEX_STRHWUSUAL
i64strf64
2"female"34.252673
1"male"41.05379
" ], "text/plain": [ "shape: (2, 3)\n", "┌─────┬─────────┬───────────┐\n", "│ SEX ┆ SEX_STR ┆ HWUSUAL │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ f64 │\n", "╞═════╪═════════╪═══════════╡\n", "│ 2 ┆ female ┆ 34.252673 │\n", "│ 1 ┆ male ┆ 41.05379 │\n", "└─────┴─────────┴───────────┘" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = query_work_hours.release().collect()\n", "\n", "# released dataframes from the OpenDP Library are shuffled to conceal the ordering of rows in the original dataset\n", "# therefore, to ensure proper alignment, we use join instead of hstack to add labels\n", "pl.DataFrame({\"SEX\": [1, 2], \"SEX_STR\": [\"male\", \"female\"]}).join(df, on=\"SEX\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Throughout this analysis, observe how the noise scale always remained fixed, regardless of group size. \n", "Also observe how the more individuals there are in the data, the greater the magnitude of the statistic. \n", "Therefore, as there are more individuals in a bin, the relative amount of noise decreases.\n", "\n", "Grouping into too many partitions, however, can result in partitions having too few records and a loss of signal/poor utility.\n", "\n", "While grouping is a very useful strategy to conduct meaningful data analysis, there must be a balance, \n", "because excessive grouping can lead to most of your partitions being filtered, \n", "and/or results that are too noisy and misleading." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While not used in this section, remember that you can request multiple different statistics in one query. \n", "Batching multiple statistics together can not only result in computational speedups, \n", "but also more balanced allocation of privacy budget across queries, \n", "and can help you avoid releasing the same protected partition keys multiple times \n", "(further dividing your privacy budget).\n", "\n", "In a real data setting where you want to mediate all access to the data through one Context,\n", "you can provide domain descriptors for many margins, \n", "allowing you to relax protections over specific grouping columns, \n", "while enforcing full protections for all other grouping columns." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 2 }