{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Manipulation\n",
"[[Polars Documentation](https://docs.pola.rs/api/python/dev/reference/lazyframe/modify_select.html)]\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import polars as pl\n",
"import opendp.prelude as dp\n",
"\n",
"dp.enable_features(\"contrib\")\n",
"# Fetch and unpack the data. \n",
"![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )\n",
"\n",
"context = dp.Context.compositor(\n",
" # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
" # so we'll set `ignore_errors` to true to avoid conversion errors.\n",
" data=pl.scan_csv(\"../sample_FR_LFS.csv\", ignore_errors=True),\n",
" privacy_unit=dp.unit_of(contributions=36),\n",
" privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),\n",
" split_evenly_over=4,\n",
" margins={(): dp.polars.Margin(max_partition_length=60_000_000 * 36)}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Cast\n",
"When a Polars LazyFrame is passed to the Context API, the data schema is read off of the dataframe.\n",
"This means that in common usage, the OpenDP Library considers the data schema to be public information,\n",
"and that the columns are already correctly typed.\n",
"\n",
"While the OpenDP Library supports cast expressions, \n",
"a drawback to their usage is that cast expressions on grouping columns will void any margin descriptors for those columns.\n",
"\n",
"One setting where you may find cast expressions useful is when computing a float sum on a large dataset.\n",
"OpenDP accounts for inexact floating-point arithmetic when computing the float sum,\n",
"and on data with large bounds and hundreds of thousands of records, this term can dominate the sensitivity."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"HWUSUAL" | "Sum" | "Float Laplace" | 843177.046991 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌─────────┬───────────┬───────────────┬───────────────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞═════════╪═══════════╪═══════════════╪═══════════════╡\n",
"│ HWUSUAL ┆ Sum ┆ Float Laplace ┆ 843177.046991 │\n",
"└─────────┴───────────┴───────────────┴───────────────┘"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" context.query()\n",
" .select(pl.col.HWUSUAL.fill_null(0.0).fill_nan(0.0).dp.sum((0, 100)))\n",
" .summarize()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Casting to integers avoids this term, resulting in a much smaller noise scale to satisfy the same level of privacy. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"HWUSUAL" | "Sum" | "Integer Laplace" | 14400.0 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌─────────┬───────────┬─────────────────┬─────────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞═════════╪═══════════╪═════════════════╪═════════╡\n",
"│ HWUSUAL ┆ Sum ┆ Integer Laplace ┆ 14400.0 │\n",
"└─────────┴───────────┴─────────────────┴─────────┘"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"context.query().select(pl.col.HWUSUAL.cast(int).fill_null(0).dp.sum((0, 100))).summarize()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The OpenDP Library forces that failed casts do not throw a (data-dependent) exception,\n",
"instead returning a null.\n",
"Therefore using this cast operation updates the output domain \n",
"to indicate that there may potentially be nulls.\n",
"You'll probably need to apply `.fill_null` before computing statistics with casted data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Clip\n",
"\n",
"Computing the sum and mean privately requires input data to be restricted between some lower and upper bound.\n",
"DP expressions like `.dp.sum` and `.dp.mean` automatically insert a `.clip` expression based on given data bounds.\n",
"However, a `.clip` transformation may be used anywhere, and it will establish a domain descriptor for the column being clipped.\n",
"When an aggregation is conducted, the library will check for the presence of this descriptor \n",
"if it is necessary to bound the sensitivity of the query.\n",
"\n",
"This is demonstrated in the following query, where the preprocessing is broken apart into different data processing phases."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"HWUSUAL" | "Sum" | "Integer Laplace" | 14400.0 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌─────────┬───────────┬─────────────────┬─────────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞═════════╪═══════════╪═════════════════╪═════════╡\n",
"│ HWUSUAL ┆ Sum ┆ Integer Laplace ┆ 14400.0 │\n",
"└─────────┴───────────┴─────────────────┴─────────┘"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" context.query()\n",
" .with_columns(pl.col.HWUSUAL.cast(int).fill_null(0).clip(0, 100))\n",
" .select(pl.col.HWUSUAL.sum().dp.noise())\n",
" .summarize()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Cut\n",
"\n",
"Cut is a transformation that bins numerical data according to a list of breaks. \n",
"The following example releases counts of the number of individuals working each hour range."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (4, 2)HWUSUAL | len |
---|
cat | u32 |
"(0, 20]" | 6546 |
"(20, 40]" | 53980 |
"(40, 60]" | 15430 |
"(98, inf]" | 119890 |
"
],
"text/plain": [
"shape: (4, 2)\n",
"┌───────────┬────────┐\n",
"│ HWUSUAL ┆ len │\n",
"│ --- ┆ --- │\n",
"│ cat ┆ u32 │\n",
"╞═══════════╪════════╡\n",
"│ (0, 20] ┆ 6546 │\n",
"│ (20, 40] ┆ 53980 │\n",
"│ (40, 60] ┆ 15430 │\n",
"│ (98, inf] ┆ 119890 │\n",
"└───────────┴────────┘"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breaks = [0, 20, 40, 60, 80, 98]\n",
"\n",
"query = (\n",
" context.query()\n",
" .with_columns(pl.col.HWUSUAL.cut(breaks=breaks))\n",
" .group_by(\"HWUSUAL\")\n",
" .agg(dp.len())\n",
")\n",
"query.release().collect().sort(\"HWUSUAL\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this setting it is not necessary to spend an additional $\\delta$ parameter to privately release the keys.\n",
"Instead we can construct an explicit key set based on the bin labels from grouping:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"len" | "Frame Length" | "Integer Laplace" | 144.0 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌────────┬──────────────┬─────────────────┬───────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞════════╪══════════════╪═════════════════╪═══════╡\n",
"│ len ┆ Frame Length ┆ Integer Laplace ┆ 144.0 │\n",
"└────────┴──────────────┴─────────────────┴───────┘"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def cut_labels(breaks, left_closed=False):\n",
" edges = [\"-inf\", *breaks, \"inf\"]\n",
" bl, br = (\"[\", \")\") if left_closed else (\"(\", \"]\")\n",
" return [f\"{bl}{l}, {r}{br}\" for l, r in zip(edges[:-1], edges[1:])]\n",
"\n",
"labels = pl.Series(\"HWUSUAL\", cut_labels(breaks), dtype=pl.Categorical)\n",
"\n",
"query = (\n",
" context.query()\n",
" .with_columns(pl.col.HWUSUAL.cut(breaks=breaks))\n",
" .group_by(\"HWUSUAL\")\n",
" .agg(dp.len())\n",
" .with_keys(pl.LazyFrame([labels]))\n",
")\n",
"query.summarize()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (7, 2)HWUSUAL | len |
---|
cat | u32 |
"(-inf, 0]" | 288 |
"(0, 20]" | 6216 |
"(20, 40]" | 54219 |
"(40, 60]" | 15574 |
"(60, 80]" | 2448 |
"(80, 98]" | 100 |
"(98, inf]" | 120004 |
"
],
"text/plain": [
"shape: (7, 2)\n",
"┌───────────┬────────┐\n",
"│ HWUSUAL ┆ len │\n",
"│ --- ┆ --- │\n",
"│ cat ┆ u32 │\n",
"╞═══════════╪════════╡\n",
"│ (-inf, 0] ┆ 288 │\n",
"│ (0, 20] ┆ 6216 │\n",
"│ (20, 40] ┆ 54219 │\n",
"│ (40, 60] ┆ 15574 │\n",
"│ (60, 80] ┆ 2448 │\n",
"│ (80, 98] ┆ 100 │\n",
"│ (98, inf] ┆ 120004 │\n",
"└───────────┴────────┘"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query.release().collect().sort(\"HWUSUAL\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output type is categorical, but with a data-independent encoding, meaning OpenDP allows grouping by these keys."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Fill NaN\n",
"`.fill_nan` replaces NaN float values. Not to be confused with `.fill_null`.\n",
"The output data is only considered non-nan if the fill expression is both non-null and non-nan.\n",
"\n",
"In common use throughout the documentation, the fill value has been simply a single scalar,\n",
"but more complicated expressions are valid:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"HWUSUAL" | "Sum" | "Float Laplace" | 843177.046991 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌─────────┬───────────┬───────────────┬───────────────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞═════════╪═══════════╪═══════════════╪═══════════════╡\n",
"│ HWUSUAL ┆ Sum ┆ Float Laplace ┆ 843177.046991 │\n",
"└─────────┴───────────┴───────────────┴───────────────┘"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" context.query()\n",
" # prepare actual work hours as a valid fill column\n",
" .with_columns(pl.col.HWACTUAL.fill_nan(0.0).fill_null(0.0))\n",
" # prepare usual work hours with actual work hours as a fill\n",
" .with_columns(pl.col.HWUSUAL.fill_nan(pl.col.HWACTUAL).fill_null(pl.col.HWACTUAL))\n",
" # compute the dp sum\n",
" .select(pl.col.HWUSUAL.dp.sum((0, 100)))\n",
" .summarize()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this time `.fill_nan` always drops data bounds, so make sure your data is non-nan before running `.clip`.\n",
"\n",
"Even if you are in an aggregation context like `.select` or `.agg`,\n",
"OpenDP enforces that inputs to `.fill_nan` are row-by-row.\n",
"This is to ensure that the left and right arguments of binary operators have meaningful row alignment,\n",
"and that inputs share the same number of records, to avoid data-dependent errors that would violate the privacy guarantee."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Fill Null\n",
"`.fill_null` replaces null values. Not to be confused with `.fill_nan`.\n",
"All data types in Polars may be null.\n",
"The output data is only considered non-null if the fill expression is non-null.\n",
"\n",
"In common use throughout the documentation, the fill value has been simply a single scalar,\n",
"but more complicated expressions are valid:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 4)column | aggregate | distribution | scale |
---|
str | str | str | f64 |
"HWUSUAL" | "Sum" | "Integer Laplace" | 14400.0 |
"
],
"text/plain": [
"shape: (1, 4)\n",
"┌─────────┬───────────┬─────────────────┬─────────┐\n",
"│ column ┆ aggregate ┆ distribution ┆ scale │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞═════════╪═══════════╪═════════════════╪═════════╡\n",
"│ HWUSUAL ┆ Sum ┆ Integer Laplace ┆ 14400.0 │\n",
"└─────────┴───────────┴─────────────────┴─────────┘"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" context.query()\n",
" # prepare actual work hours as a valid fill column\n",
" .with_columns(pl.col.HWACTUAL.cast(int).fill_null(0.0))\n",
" # prepare usual work hours with actual work hours as a fill\n",
" .with_columns(pl.col.HWUSUAL.cast(int).fill_null(pl.col.HWACTUAL))\n",
" # compute the dp sum\n",
" .select(pl.col.HWUSUAL.dp.sum((0, 100)))\n",
" .summarize()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this time `.fill_null` always drops data bounds, so make sure your data is non-null before running `.clip`.\n",
"\n",
"Just like `.fill_nan`, even if you are in an aggregation context like `.select` or `.agg`,\n",
"OpenDP enforces that inputs to `.fill_nan` are row-by-row."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## To Physical\n",
"\n",
"`.to_physical` returns the underlying data representation categorical (`pl.Categorical`) or temporal (`pl.Date`, `pl.Time`, `pl.Datetime`) data types.\n",
"For example, you can use the `.to_physical` expression to retrieve the bin indices of the `.cut` expression."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (7, 2)HWUSUAL | len |
---|
u32 | u32 |
0 | 366 |
1 | 6292 |
2 | 54002 |
3 | 15308 |
4 | 2492 |
5 | 0 |
6 | 119904 |
"
],
"text/plain": [
"shape: (7, 2)\n",
"┌─────────┬────────┐\n",
"│ HWUSUAL ┆ len │\n",
"│ --- ┆ --- │\n",
"│ u32 ┆ u32 │\n",
"╞═════════╪════════╡\n",
"│ 0 ┆ 366 │\n",
"│ 1 ┆ 6292 │\n",
"│ 2 ┆ 54002 │\n",
"│ 3 ┆ 15308 │\n",
"│ 4 ┆ 2492 │\n",
"│ 5 ┆ 0 │\n",
"│ 6 ┆ 119904 │\n",
"└─────────┴────────┘"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breaks = [0, 20, 40, 60, 80, 98]\n",
"labels = pl.Series(\"HWUSUAL\", list(range(len(breaks) + 1)), dtype=pl.UInt32)\n",
"\n",
"query = (\n",
" context.query()\n",
" .with_columns(pl.col.HWUSUAL.cut(breaks=breaks).to_physical())\n",
" .group_by(\"HWUSUAL\")\n",
" .agg(dp.len())\n",
" .with_keys(pl.LazyFrame([labels]))\n",
")\n",
"query.release().collect().sort(\"HWUSUAL\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the case of categorical data types, \n",
"OpenDP only allows this expression if the encoding is data-independent.\n",
"More information can be found in [Data Types](../data-types.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}