{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping\n", "\n", "The OpenDP Library allows you to compute statistics over grouped data.\n", "We'll examine three settings with progressively more public information:\n", "\n", "- Protected group keys: `public_info=None`\n", "- Public group keys: `public_info=\"keys\"`\n", "- Public group lengths: `public_info=\"lengths\"`\n", "\n", "The [API Reference](../../api/python/opendp.extras.polars.rst) provides more information about the methods. \n", "We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/main/data/eurostat/README.ipynb) from the Labor Force Survey in France." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/javascript": "(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n var py_version = '3.4.3'.replace('rc', '-rc.').replace('.dev', '-dev.');\n var reloading = false;\n var Bokeh = root.Bokeh;\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks;\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, js_modules, js_exports, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n if (js_modules == null) js_modules = [];\n if (js_exports == null) js_exports = {};\n\n root._bokeh_onload_callbacks.push(callback);\n\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls.length === 0 && js_modules.length === 0 && Object.keys(js_exports).length === 0) {\n run_callbacks();\n return null;\n }\n if (!reloading) {\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n }\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n window._bokeh_on_load = on_load\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n var skip = [];\n if (window.requirejs) {\n window.requirejs.config({'packages': {}, 'paths': {}, 'shim': {}});\n root._bokeh_is_loading = css_urls.length + 0;\n } else {\n root._bokeh_is_loading = css_urls.length + js_urls.length + js_modules.length + Object.keys(js_exports).length;\n }\n\n var existing_stylesheets = []\n var links = document.getElementsByTagName('link')\n for (var i = 0; i < links.length; i++) {\n var link = links[i]\n if (link.href != null) {\n\texisting_stylesheets.push(link.href)\n }\n }\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n if (existing_stylesheets.indexOf(url) !== -1) {\n\ton_load()\n\tcontinue;\n }\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n } var existing_scripts = []\n var scripts = document.getElementsByTagName('script')\n for (var i = 0; i < scripts.length; i++) {\n var script = scripts[i]\n if (script.src != null) {\n\texisting_scripts.push(script.src)\n }\n }\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n if (skip.indexOf(url) !== -1 || existing_scripts.indexOf(url) !== -1) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (var i = 0; i < js_modules.length; i++) {\n var url = js_modules[i];\n if (skip.indexOf(url) !== -1 || existing_scripts.indexOf(url) !== -1) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (const name in js_exports) {\n var url = js_exports[name];\n if (skip.indexOf(url) >= 0 || root[name] != null) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onerror = on_error;\n element.async = false;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n element.textContent = `\n import ${name} from \"${url}\"\n window.${name} = ${name}\n window._bokeh_on_load()\n `\n document.head.appendChild(element);\n }\n if (!js_urls.length && !js_modules.length) {\n on_load()\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.4.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.4.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.4.3.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.4.3.min.js\", \"https://cdn.holoviz.org/panel/1.4.5/dist/panel.min.js\"];\n var js_modules = [];\n var js_exports = {};\n var css_urls = [];\n var inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {} // ensure no trailing comma for IE\n ];\n\n function run_inline_js() {\n if ((root.Bokeh !== undefined) || (force === true)) {\n for (var i = 0; i < inline_js.length; i++) {\n\ttry {\n inline_js[i].call(root, root.Bokeh);\n\t} catch(e) {\n\t if (!reloading) {\n\t throw e;\n\t }\n\t}\n }\n // Cache old bokeh versions\n if (Bokeh != undefined && !reloading) {\n\tvar NewBokeh = root.Bokeh;\n\tif (Bokeh.versions === undefined) {\n\t Bokeh.versions = new Map();\n\t}\n\tif (NewBokeh.version !== Bokeh.version) {\n\t Bokeh.versions.set(NewBokeh.version, NewBokeh)\n\t}\n\troot.Bokeh = Bokeh;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n }\n root._bokeh_is_initializing = false\n }\n\n function load_or_wait() {\n // Implement a backoff loop that tries to ensure we do not load multiple\n // versions of Bokeh and its dependencies at the same time.\n // In recent versions we use the root._bokeh_is_initializing flag\n // to determine whether there is an ongoing attempt to initialize\n // bokeh, however for backward compatibility we also try to ensure\n // that we do not start loading a newer (Panel>=1.0 and Bokeh>3) version\n // before older versions are fully initialized.\n if (root._bokeh_is_initializing && Date.now() > root._bokeh_timeout) {\n root._bokeh_is_initializing = false;\n root._bokeh_onload_callbacks = undefined;\n console.log(\"Bokeh: BokehJS was loaded multiple times but one version failed to initialize.\");\n load_or_wait();\n } else if (root._bokeh_is_initializing || (typeof root._bokeh_is_initializing === \"undefined\" && root._bokeh_onload_callbacks !== undefined)) {\n setTimeout(load_or_wait, 100);\n } else {\n root._bokeh_is_initializing = true\n root._bokeh_onload_callbacks = []\n var bokeh_loaded = Bokeh != null && (Bokeh.version === py_version || (Bokeh.versions !== undefined && Bokeh.versions.has(py_version)));\n if (!reloading && !bokeh_loaded) {\n\troot.Bokeh = undefined;\n }\n load_libs(css_urls, js_urls, js_modules, js_exports, function() {\n\tconsole.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n\trun_inline_js();\n });\n }\n }\n // Give older versions of the autoload script a head-start to ensure\n // they initialize before we start loading newer version.\n setTimeout(load_or_wait, 100)\n}(window));", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": "\nif ((window.PyViz === undefined) || (window.PyViz instanceof HTMLElement)) {\n window.PyViz = {comms: {}, comm_status:{}, kernels:{}, receivers: {}, plot_index: []}\n}\n\n\n function JupyterCommManager() {\n }\n\n JupyterCommManager.prototype.register_target = function(plot_id, comm_id, msg_handler) {\n if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n comm_manager.register_target(comm_id, function(comm) {\n comm.on_msg(msg_handler);\n });\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n window.PyViz.kernels[plot_id].registerCommTarget(comm_id, function(comm) {\n comm.onMsg = msg_handler;\n });\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n google.colab.kernel.comms.registerTarget(comm_id, (comm) => {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n console.log(message)\n var content = {data: message.data, comm_id};\n var buffers = []\n for (var buffer of message.buffers || []) {\n buffers.push(new DataView(buffer))\n }\n var metadata = message.metadata || {};\n var msg = {content, buffers, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n })\n }\n }\n\n JupyterCommManager.prototype.get_client_comm = function(plot_id, comm_id, msg_handler) {\n if (comm_id in window.PyViz.comms) {\n return window.PyViz.comms[comm_id];\n } else if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n var comm = comm_manager.new_comm(comm_id, {}, {}, {}, comm_id);\n if (msg_handler) {\n comm.on_msg(msg_handler);\n }\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n var comm = window.PyViz.kernels[plot_id].connectToComm(comm_id);\n comm.open();\n if (msg_handler) {\n comm.onMsg = msg_handler;\n }\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n var comm_promise = google.colab.kernel.comms.open(comm_id)\n comm_promise.then((comm) => {\n window.PyViz.comms[comm_id] = comm;\n if (msg_handler) {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n var content = {data: message.data};\n var metadata = message.metadata || {comm_id};\n var msg = {content, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n }\n }) \n var sendClosure = (data, metadata, buffers, disposeOnDone) => {\n return comm_promise.then((comm) => {\n comm.send(data, metadata, buffers, disposeOnDone);\n });\n };\n var comm = {\n send: sendClosure\n };\n }\n window.PyViz.comms[comm_id] = comm;\n return comm;\n }\n window.PyViz.comm_manager = new JupyterCommManager();\n \n\n\nvar JS_MIME_TYPE = 'application/javascript';\nvar HTML_MIME_TYPE = 'text/html';\nvar EXEC_MIME_TYPE = 'application/vnd.holoviews_exec.v0+json';\nvar CLASS_NAME = 'output';\n\n/**\n * Render data to the DOM node\n */\nfunction render(props, node) {\n var div = document.createElement(\"div\");\n var script = document.createElement(\"script\");\n node.appendChild(div);\n node.appendChild(script);\n}\n\n/**\n * Handle when a new output is added\n */\nfunction handle_add_output(event, handle) {\n var output_area = handle.output_area;\n var output = handle.output;\n if ((output.data == undefined) || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n return\n }\n var id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n if (id !== undefined) {\n var nchildren = toinsert.length;\n var html_node = toinsert[nchildren-1].children[0];\n html_node.innerHTML = output.data[HTML_MIME_TYPE];\n var scripts = [];\n var nodelist = html_node.querySelectorAll(\"script\");\n for (var i in nodelist) {\n if (nodelist.hasOwnProperty(i)) {\n scripts.push(nodelist[i])\n }\n }\n\n scripts.forEach( function (oldScript) {\n var newScript = document.createElement(\"script\");\n var attrs = [];\n var nodemap = oldScript.attributes;\n for (var j in nodemap) {\n if (nodemap.hasOwnProperty(j)) {\n attrs.push(nodemap[j])\n }\n }\n attrs.forEach(function(attr) { newScript.setAttribute(attr.name, attr.value) });\n newScript.appendChild(document.createTextNode(oldScript.innerHTML));\n oldScript.parentNode.replaceChild(newScript, oldScript);\n });\n if (JS_MIME_TYPE in output.data) {\n toinsert[nchildren-1].children[1].textContent = output.data[JS_MIME_TYPE];\n }\n output_area._hv_plot_id = id;\n if ((window.Bokeh !== undefined) && (id in Bokeh.index)) {\n window.PyViz.plot_index[id] = Bokeh.index[id];\n } else {\n window.PyViz.plot_index[id] = null;\n }\n } else if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n var bk_div = document.createElement(\"div\");\n bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n var script_attrs = bk_div.children[0].attributes;\n for (var i = 0; i < script_attrs.length; i++) {\n toinsert[toinsert.length - 1].childNodes[1].setAttribute(script_attrs[i].name, script_attrs[i].value);\n }\n // store reference to server id on output_area\n output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n }\n}\n\n/**\n * Handle when an output is cleared or removed\n */\nfunction handle_clear_output(event, handle) {\n var id = handle.cell.output_area._hv_plot_id;\n var server_id = handle.cell.output_area._bokeh_server_id;\n if (((id === undefined) || !(id in PyViz.plot_index)) && (server_id !== undefined)) { return; }\n var comm = window.PyViz.comm_manager.get_client_comm(\"hv-extension-comm\", \"hv-extension-comm\", function () {});\n if (server_id !== null) {\n comm.send({event_type: 'server_delete', 'id': server_id});\n return;\n } else if (comm !== null) {\n comm.send({event_type: 'delete', 'id': id});\n }\n delete PyViz.plot_index[id];\n if ((window.Bokeh !== undefined) & (id in window.Bokeh.index)) {\n var doc = window.Bokeh.index[id].model.document\n doc.clear();\n const i = window.Bokeh.documents.indexOf(doc);\n if (i > -1) {\n window.Bokeh.documents.splice(i, 1);\n }\n }\n}\n\n/**\n * Handle kernel restart event\n */\nfunction handle_kernel_cleanup(event, handle) {\n delete PyViz.comms[\"hv-extension-comm\"];\n window.PyViz.plot_index = {}\n}\n\n/**\n * Handle update_display_data messages\n */\nfunction handle_update_output(event, handle) {\n handle_clear_output(event, {cell: {output_area: handle.output_area}})\n handle_add_output(event, handle)\n}\n\nfunction register_renderer(events, OutputArea) {\n function append_mime(data, metadata, element) {\n // create a DOM node to render to\n var toinsert = this.create_output_subarea(\n metadata,\n CLASS_NAME,\n EXEC_MIME_TYPE\n );\n this.keyboard_manager.register_events(toinsert);\n // Render to node\n var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n render(props, toinsert[0]);\n element.append(toinsert);\n return toinsert\n }\n\n events.on('output_added.OutputArea', handle_add_output);\n events.on('output_updated.OutputArea', handle_update_output);\n events.on('clear_output.CodeCell', handle_clear_output);\n events.on('delete.Cell', handle_clear_output);\n events.on('kernel_ready.Kernel', handle_kernel_cleanup);\n\n OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n safe: true,\n index: 0\n });\n}\n\nif (window.Jupyter !== undefined) {\n try {\n var events = require('base/js/events');\n var OutputArea = require('notebook/js/outputarea').OutputArea;\n if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n register_renderer(events, OutputArea);\n }\n } catch(err) {\n }\n}\n", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ] }, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1002" } }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\n", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\n", "100 5933k 100 5933k 0 0 1660k 0 0:00:03 0:00:03 --:--:-- 1894k\n", "Archive: sample_FR_LFS.csv.zip\n", " inflating: sample_FR_LFS.csv \n", " inflating: __MACOSX/._sample_FR_LFS.csv \n" ] } ], "source": [ "import polars as pl\n", "import opendp.prelude as dp\n", "import hvplot\n", "\n", "dp.enable_features(\"contrib\")\n", "hvplot.extension(\"bokeh\") \n", "\n", "# Fetch data. \n", "![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Protected Group Keys\n", "\n", "Grouping keys themselves can be extremely sensitive. \n", "For example, sharing a noisy count query grouped by social security numbers or credit card numbers \n", "would catastrophically violate the privacy of individuals in the data.\n", "Similarly, re-identification of individuals can be accomplished \n", "with very small combinations of grouping columns that may seem benign\n", "(for reference, see how [Latanya Sweeney reidentified the Governor of Massachusetts in healthcare data](https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/)).\n", "\n", "For this reason, by default, the OpenDP Library discards any grouping key that may be unique to an individual\n", "by filtering out data partitions with too few records.\n", "OpenDP calibrates this filtering threshold such that the probability of releasing a sensitive grouping key \n", "is no greater than the privacy parameter delta (δ).\n", "\n", "In the following example, only counts for combinations of year and sex that are common among many workers are released. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", " # so we'll set `ignore_errors` to true to avoid conversion errors.\n", " data=pl.scan_csv(\"sample_FR_LFS.csv\", ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 3, delta=1e-7),\n", " # allow for one query\n", " split_evenly_over=1,\n", ")\n", "\n", "query_age_ilostat = (\n", " context.query()\n", " .group_by(\"AGE\", \"ILOSTAT\")\n", " .agg(pl.len().dp.noise())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before releasing the query, lets take a look at properties we can expect of the output:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 6)
columnaggregatedistributionscaleaccuracythreshold
strstrstrf64f64u32
"len""Len""Integer Laplace"108.0324.0379282089
" ], "text/plain": [ "shape: (1, 6)\n", "┌────────┬───────────┬─────────────────┬───────┬────────────┬───────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy ┆ threshold │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ u32 │\n", "╞════════╪═══════════╪═════════════════╪═══════╪════════════╪═══════════╡\n", "│ len ┆ Len ┆ Integer Laplace ┆ 108.0 ┆ 324.037928 ┆ 2089 │\n", "└────────┴───────────┴─────────────────┴───────┴────────────┴───────────┘" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_age_ilostat.summarize(alpha=.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Judging from the query description, \n", "noisy counts returned by this query will differ by no more than the given accuracy with 1 - alpha = 95% confidence.\n", "Any grouping keys with noisy counts below the given threshold will be filtered from the release.\n", "\n", "This level of utility seems suitable, so we'll go ahead and release the query:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Overlay\n", " .NdOverlay.I :NdOverlay [ILOSTAT]\n", " :Curve [AGE] (len)\n", " .NdOverlay.II :NdOverlay [ILOSTAT]\n", " :Scatter [AGE] (len)" ] }, "execution_count": 5, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1006" } }, "output_type": "execute_result" } ], "source": [ "df = query_age_ilostat.release().collect()\n", "\n", "line = df.sort(\"AGE\").plot.line(x=\"AGE\", y=\"len\", by=\"ILOSTAT\")\n", "scatter = df.sort(\"AGE\").plot.scatter(x=\"AGE\", y=\"len\", by=\"ILOSTAT\")\n", "line * scatter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Referring back to the ILOSTAT legend:\n", "\n", "| ILOSTAT | Legend | Comments | \n", "| ------- | ------ | ----------- | \n", "| 1 | did any work for pay or profit | most common among people between 30 and 50 |\n", "| 2 | employed but not working | slightly more commonly observed among young adults |\n", "| 3 | was not working because of lay-off | clearly influenced by retirement |\n", "| 9 | not applicable, less than 15 years old | only present for the youngest age group |\n", "\n", "Where points are missing in the graph, \n", "there are not enough individuals for that combination of age and employment status to pass threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Public Group Keys\n", "\n", "The OpenDP Library also allows you to explicitly describe grouping keys as \n", "\"public information\" that can be released in the clear.\n", "\n", "Be aware that any aspect of your data labeled \"public information\" is not subject to privacy protections.\n", "For this reason, descriptors should be used conservatively, or not at all.\n", "When used judiciously, \n", "domain descriptors can improve the utility of your releases without damaging the integrity of your privacy guarantee.\n", "\n", "For example, in the Eurostat data, the quarters in which data has been collected may be considered public information.\n", "You can mark grouping keys for any combination of year and quarter by setting `public_info=\"keys\"` as follows:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " data=pl.scan_csv(\"sample_FR_LFS.csv\", ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 3),\n", " split_evenly_over=1,\n", " margins={\n", " # grouping keys by \"YEAR\" and \"QUARTER\" are public information\n", " (\"YEAR\", \"QUARTER\"): dp.polars.Margin(\n", " public_info=\"keys\",\n", " )\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> It is recommended to only ever create one Context that spans all queries you may make on your data.\n", "> We create a second context here to demonstrate how margins will influence the analysis.\n", "\n", "Due to the existence of this margin descriptor, the library will now release these quarterly keys in the clear." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"len""Len""Integer Laplace"108.0324.037928
" ], "text/plain": [ "shape: (1, 5)\n", "┌────────┬───────────┬─────────────────┬───────┬────────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 │\n", "╞════════╪═══════════╪═════════════════╪═══════╪════════════╡\n", "│ len ┆ Len ┆ Integer Laplace ┆ 108.0 ┆ 324.037928 │\n", "└────────┴───────────┴─────────────────┴───────┴────────────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_quarterly_counts = (\n", " context.query()\n", " .group_by(\"YEAR\", \"QUARTER\")\n", " .agg(pl.len().dp.noise())\n", ")\n", "\n", "summary = query_quarterly_counts.summarize(alpha=.05)\n", "summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query description no longer has a \"threshold\" field: all noisy statistics computed on each data partition will be released." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This visualization includes error bars showing 95% confidence intervals for the true value by pulling the accuracy estimate from the query description above." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Overlay\n", " .Curve.I :Curve [date] (len)\n", " .ErrorBars.I :ErrorBars [date] (len,accuracy)" ] }, "execution_count": 8, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1200" } }, "output_type": "execute_result" } ], "source": [ "df = query_quarterly_counts.release().collect()\n", "\n", "# build a date column\n", "df = df.with_columns(pl.date(pl.col(\"YEAR\"), pl.col(\"QUARTER\") * 4, 1))\n", "\n", "line = df.plot.line(x=\"date\", y=\"len\")\n", "errorbars = df.with_columns(accuracy=summary[\"accuracy\"][0]) \\\n", " .plot.errorbars(x=\"date\", y=\"len\", yerr1=\"accuracy\")\n", "line * errorbars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the noise scale and accuracy estimate is the same as in the previous protected group keys query,\n", "the relative error is now much larger because the group sizes are much smaller.\n", "In spite of this, the release clearly still shows that the number of respondents increased significantly from 2008 to 2010." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Public Group Sizes\n", "\n", "It is also possible to declare partition sizes as public information.\n", "Setting this value implies that partition keys are public, \n", "and thus implies an even greater risk of damaging the integrity of the privacy guarantee \n", "than if you were to just specify grouping keys as public.\n", "\n", "Nevertheless, this approach has seen use in high-profile data releases, \n", "including the US Census Bureau in the form of \"data invariants\".\n", "\n", "One way this could be used is as part of a release for the mean number of hours worked `HWUSUAL` by sex.\n", "Referring back to the `HWUSUAL` legend, a value of `99` means not applicable,\n", "and this encoding will significantly bias the outcome of the query.\n", "\n", "Unfortunately, filtering the data within the query results in the margin info being invalidated.\n", "We are still working on expanding the preprocessing functionality and logic for preserving domain descriptors in the library,\n", "but one way to work around this limitation in the meantime is to preprocess your data before passing it into the context:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "lf_preprocessed = pl.scan_csv(\"sample_FR_LFS.csv\", ignore_errors=True).filter(pl.col(\"HWUSUAL\") < 99)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now set up your analysis such that the margin applies to the preprocessed data:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " data=lf_preprocessed,\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0 / 3),\n", " split_evenly_over=1,\n", " margins={\n", " # total number of responses when grouped by \"SEX\" is public information\n", " (\"SEX\",): dp.polars.Margin(\n", " public_info=\"lengths\",\n", " max_partition_length=60_000_000, # population of France\n", " max_num_partitions=1,\n", " )\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now prepare a query that computes mean working hours by gender, where each response is clipped between 0 and 98:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Float Laplace"5762.0240217261.481317
"HWUSUAL""Len"nullnull0.0
" ], "text/plain": [ "shape: (2, 5)\n", "┌─────────┬───────────┬───────────────┬────────────┬──────────────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale ┆ accuracy │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 ┆ f64 │\n", "╞═════════╪═══════════╪═══════════════╪════════════╪══════════════╡\n", "│ HWUSUAL ┆ Sum ┆ Float Laplace ┆ 5762.02402 ┆ 17261.481317 │\n", "│ HWUSUAL ┆ Len ┆ null ┆ null ┆ 0.0 │\n", "└─────────┴───────────┴───────────────┴────────────┴──────────────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "query_work_hours = (\n", " context.query()\n", " .group_by(\"SEX\")\n", " .agg(pl.col(\"HWUSUAL\").fill_null(0).dp.mean((0, 98)))\n", ")\n", "\n", "query_work_hours.summarize(alpha=.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time the query description breaks down into two separate statistics for the resulting \"HWUSUAL\" column.\n", "The mean is computed by dividing separate sum and length estimates.\n", "Since the partition length is marked as public information, the length is released in the clear, without any noise.\n", "\n", "Reading the table, the smallest suitable noise scale parameter for the sum is 5762. \n", "This may seem large, but remember this noisy sum will be divided by the length, reducing the variance of the final estimate.\n", "You can divide the sum accuracy estimate by the public partition sizes to get disaggregated accuracy estimates for each partition. \n", "\n", "Be mindful that accuracy estimates for sums and means do not take into account bias introduced from clipping.\n", "However, for this specific dataset, \n", "the codebook for limits the range of work hours in `HWUSUAL` to between 0 and 98,\n", "so the clipping introduced for differential privacy will not further increase bias." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 3)
SEXSEX_STRHWUSUAL
i64strf64
1"male"40.66939
2"female"34.380722
" ], "text/plain": [ "shape: (2, 3)\n", "┌─────┬─────────┬───────────┐\n", "│ SEX ┆ SEX_STR ┆ HWUSUAL │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ f64 │\n", "╞═════╪═════════╪═══════════╡\n", "│ 1 ┆ male ┆ 40.66939 │\n", "│ 2 ┆ female ┆ 34.380722 │\n", "└─────┴─────────┴───────────┘" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = query_work_hours.release().collect()\n", "\n", "# released dataframes from the OpenDP Library are shuffled to conceal the ordering of rows in the original dataset\n", "# therefore, to ensure proper alignment, we use join instead of hstack to add labels\n", "pl.DataFrame({\"SEX\": [1, 2], \"SEX_STR\": [\"male\", \"female\"]}).join(df, on=\"SEX\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Throughout this analysis, observe how the noise scale always remained fixed, regardless of group size. \n", "Also observe how the more individuals there are in the data, the greater the magnitude of the statistic. \n", "Therefore, as there are more individuals in a bin, the relative amount of noise decreases.\n", "\n", "Grouping by too many keys, however, can result in partitions having too few records and a loss of signal/poor utility.\n", "\n", "While grouping is a very useful strategy to conduct meaningful data analysis, there must be a balance, \n", "because excessive grouping can lead to most of your partitions being filtered, \n", "and/or results that are too noisy and misleading." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While not used in this section, remember that you can request multiple different statistics in one query. \n", "Batching multiple statistics together can not only result in computational speedups, \n", "but also more balanced allocation of privacy budget across queries, \n", "and can help you avoid releasing the same protected grouping keys multiple times \n", "(further dividing your privacy budget).\n", "\n", "In a real data setting where you want to mediate all access to the data through one Context,\n", "you can provide domain descriptors for many margins, \n", "allowing you to relax protections over specific grouping columns, \n", "while enforcing full protections for all other grouping columns." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 2 }