{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregation\n", "\n", "[[Polars Documentation](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)]\n", "\n", "The most common aggregators like length, sum, mean, median and quantile are covered in [essential statistics](../../../../getting-started/tabular-data/essential-statistics.ipynb).\n", "\n", "In addition to these aggregators, OpenDP also supports other variations of counting queries.\n", "A counting query tells you how many rows in a dataset meet a given condition." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import polars as pl \n", "import opendp.prelude as dp\n", "\n", "dp.enable_features(\"contrib\")\n", "\n", "# Fetch and unpack the data. \n", "![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get started, we'll recreate the Context from the [tabular data introduction](../index.rst)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "context = dp.Context.compositor(\n", " data=pl.scan_csv(\"../sample_FR_LFS.csv\", ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0),\n", " split_evenly_over=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Frame Length vs Expression Length\n", "\n", "[Frame length](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.len.html)\n", "is not the same as [expression length](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.len.html). \n", "These quantities can differ if the expression changes the number of rows." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"360.0
"HWUSUAL""Length""Integer Laplace"360.0
" ], "text/plain": [ "shape: (2, 4)\n", "┌─────────┬──────────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞═════════╪══════════════╪═════════════════╪═══════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 360.0 │\n", "│ HWUSUAL ┆ Length ┆ Integer Laplace ┆ 360.0 │\n", "└─────────┴──────────────┴─────────────────┴───────┘" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_len_variations = (\n", " context.query()\n", " .group_by(\"SEX\")\n", " .agg([\n", " # total number of rows in the frame, including nulls\n", " dp.len(),\n", " # total number of rows in the HWUSUAL column (including nulls)\n", " pl.col.HWUSUAL.dp.len(),\n", " ])\n", " # explicitly specifying keys makes the query satisfy pure-DP\n", " .with_keys(pl.LazyFrame({\"SEX\": [1, 2]}))\n", ")\n", "query_len_variations.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two statistics are equivalent, \n", "but the frame length (the first) can be used to release stable grouping keys, \n", "while the column length (the second) can be preprocessed with filtering.\n", "\n", "The OpenDP Library will still use margin descriptors that may reduce the sensitivity of the column length\n", "if it detects that the column has not been transformed in a way that changes the number of rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 3)
SEXlenHWUSUAL
i64u32u32
19576295047
2103723104234
" ], "text/plain": [ "shape: (2, 3)\n", "┌─────┬────────┬─────────┐\n", "│ SEX ┆ len ┆ HWUSUAL │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ u32 ┆ u32 │\n", "╞═════╪════════╪═════════╡\n", "│ 1 ┆ 95762 ┆ 95047 │\n", "│ 2 ┆ 103723 ┆ 104234 │\n", "└─────┴────────┴─────────┘" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_len_variations.release().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unique Counts\n", "A count of the number of unique values in a column is as sensitive as the frame or column length when protecting user contributions.\n", "However, unlike the frame length, the sensitivity does not reduce to zero when protecting changed records, \n", "as a change in an individual's answer may result in one more, or one less, unique value." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"HWUSUAL""N Unique""Integer Laplace"180.0
" ], "text/plain": [ "shape: (1, 4)\n", "┌─────────┬───────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞═════════╪═══════════╪═════════════════╪═══════╡\n", "│ HWUSUAL ┆ N Unique ┆ Integer Laplace ┆ 180.0 │\n", "└─────────┴───────────┴─────────────────┴───────┘" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_n_unique = context.query().select([\n", " # total number of unique elements in the HWUSUAL column (including null)\n", " pl.col.HWUSUAL.dp.n_unique(),\n", "])\n", "query_n_unique.summarize()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
HWUSUAL
u32
0
" ], "text/plain": [ "shape: (1, 1)\n", "┌─────────┐\n", "│ HWUSUAL │\n", "│ --- │\n", "│ u32 │\n", "╞═════════╡\n", "│ 0 │\n", "└─────────┘" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_n_unique.release().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Noise added to a count can make the count go negative, \n", "but since the output data type is an unsigned integer, the library may return zero. \n", "This is more likely to happen with the true value is small.\n", "\n", "This release tells us that the number of null values is relatively small." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Null and Non-Null Counts\n", "You can release a count of the number of null or non-null records, respectively, as follows:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 4)
columnaggregatedistributionscale
strstrstrf64
"HWUSUAL""Count""Integer Laplace"360.0
"HWUSUAL""Null Count""Integer Laplace"360.0
" ], "text/plain": [ "shape: (2, 4)\n", "┌─────────┬────────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞═════════╪════════════╪═════════════════╪═══════╡\n", "│ HWUSUAL ┆ Count ┆ Integer Laplace ┆ 360.0 │\n", "│ HWUSUAL ┆ Null Count ┆ Integer Laplace ┆ 360.0 │\n", "└─────────┴────────────┴─────────────────┴───────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_counts = context.query().select([\n", " # total number of non-null elements in the HWUSUAL column\n", " pl.col.HWUSUAL.dp.count(),\n", " # total number of null elements in the HWUSUAL column\n", " pl.col.HWUSUAL.dp.null_count(),\n", "])\n", "query_counts.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the ``count`` and ``null_count`` are complementary:\n", "you could instead release ``len`` for ``HWUSUAL`` grouped by whether the value is null.\n", "\n", "You can take advantage of this to estimate both statistics with the same privacy loss, but with half as much noise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"180.0
" ], "text/plain": [ "shape: (1, 4)\n", "┌────────┬──────────────┬─────────────────┬───────┐\n", "│ column ┆ aggregate ┆ distribution ┆ scale │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ f64 │\n", "╞════════╪══════════════╪═════════════════╪═══════╡\n", "│ len ┆ Frame Length ┆ Integer Laplace ┆ 180.0 │\n", "└────────┴──────────────┴─────────────────┴───────┘" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_counts_via_grouping = (\n", " context.query()\n", " .with_columns(pl.col(\"HWUSUAL\").is_null().alias(\"HWUSUAL_is_null\"))\n", " .group_by(\"HWUSUAL_is_null\")\n", " .agg(dp.len())\n", " # we're grouping on a bool column, so the groups are:\n", " .with_keys(pl.LazyFrame({\"HWUSUAL_is_null\": [True, False]}))\n", ")\n", "query_counts_via_grouping.summarize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The noise scale dropped from 360 to 180..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
HWUSUAL_is_nulllen
boolu32
false197918
true1297
" ], "text/plain": [ "shape: (2, 2)\n", "┌─────────────────┬────────┐\n", "│ HWUSUAL_is_null ┆ len │\n", "│ --- ┆ --- │\n", "│ bool ┆ u32 │\n", "╞═════════════════╪════════╡\n", "│ false ┆ 197918 │\n", "│ true ┆ 1297 │\n", "└─────────────────┴────────┘" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_counts_via_grouping.release().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...but we still get answers to all of the same queries!" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 2 }