Preparing Microdata#

Data is seldom already in the form you need it in. We use Polars expressions to describe how to build new columns and Polars contexts to describe how those expressions are applied to your data. More information can be found in the Polars User Guide.

This section explains OpenDP’s supported contexts for preparing microdata (column addition and filtering).

Python

>>> import polars as pl
>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")

>>> context = dp.Context.compositor(
...     # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,
...     # so we'll set `ignore_errors` to true to avoid conversion errors.
...     data=pl.scan_csv(
...         dp.examples.get_france_lfs_path(),
...         ignore_errors=True,
...     ),
...     privacy_unit=dp.unit_of(contributions=36),
...     privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
...     split_evenly_over=4,
...     margins=[dp.polars.Margin(max_length=150_000 * 36)],
... )

OpenDP allows expressions used in the .select context and .agg context to change the number and order of rows, whereas expressions used in the .with_columns context, .filter context and .group_by context must be row-by-row.

With Columns#

[Polars Documentation]

.with_columns resolves each passed expression to a column and then adds those columns to the data.

Python

>>> query_hwusual_binned = (
...     context.query()
...     # shadows the usual work hours "HWUSUAL" column with binned data
...     .with_columns(
...         pl.col.HWUSUAL.cut(
...             breaks=[0, 20, 40, 60, 80, 98],
...             left_closed=True,
...         )
...     )
...     .group_by(pl.col.HWUSUAL)
...     .agg(dp.len())
... )
>>> query_hwusual_binned.release().collect().sort(
...     "HWUSUAL"
... )  
shape: (7, 2)
┌───────────┬─────────┐
│ HWUSUAL   ┆ len     │
│ ---       ┆ ---     │
│ cat       ┆ u32     │
╞═══════════╪═════════╡
│ null      ┆ ...     │
│ [0, 20)   ┆ ...     │
│ [20, 40)  ┆ ...     │
│ [40, 60)  ┆ ...     │
│ [60, 80)  ┆ ...     │
│ [80, 98)  ┆ ...     │
│ [98, inf) ┆ ...     │
└───────────┴─────────┘

To ensure that the privacy unit remains meaningful, expressions passed into .with_columns must be row-by-row, meaning that the expression could be represented as a function applied to each row in the data. The row-by-row property implies that the expression doesn’t break the alignment between individual contributions in the data and their individual contributions in the new constructed columns.

Another consideration is that any new columns added by .with_columns do not (currently) have margin descriptors. For instance, in the above query, any margin descriptors related to HWUSUAL would no longer apply to the new, shadowing, HWUSUAL column after .with_columns.

Select#

[Polars Documentation]

.select resolves each passed expression to a column and then returns those columns. The behavior is the same as .with_columns, but only the columns specified in expressions will remain.

Filter#

[Polars Documentation]

.filter uses row-by-row expressions of booleans to mask rows.

Python

>>> query_total_hours_worked = (
...     context.query()
...     .with_columns(pl.col.HWUSUAL.cast(int))
...     .filter(pl.col.HWUSUAL != 99)
...     .select(pl.col.HWUSUAL.dp.sum((0, 80)))
... )
>>> print(
...     "sum:",
...     query_total_hours_worked.release().collect().item(),
... )  
sum: ...

Filtering discards all invariants about the group keys and group sizes. Margin descriptors are considered applicable for the input dataset, so a data-dependent filtering renders these invariants invalid.

Otherwise, filtering preserves all other margin descriptors, because filtering only ever removes rows.

Group By (Private)#

Polars Documentation

.group_by also resolves each passed expression to a column, and then groups on those columns. Just like .select and .with_columns, these expressions must be row-by-row.

Python

>>> query_hwusual_binned = (
...     context.query()
...     .group_by(
...         pl.col.HWUSUAL.cut(
...             [0, 20, 40, 60, 80, 98], left_closed=True
...         )
...     )
...     .agg(dp.len())
... )
>>> query_hwusual_binned.release().collect().sort(
...     "HWUSUAL"
... )  
shape: (7, 2)
┌───────────┬─────────┐
│ HWUSUAL   ┆ len     │
│ ---       ┆ ---     │
│ cat       ┆ u32     │
╞═══════════╪═════════╡
│ null      ┆ ...     │
│ [0, 20)   ┆ ...     │
│ [20, 40)  ┆ ...     │
│ [40, 60)  ┆ ...     │
│ [60, 80)  ┆ ...     │
│ [80, 98)  ┆ ...     │
│ [98, inf) ┆ ...     │
└───────────┴─────────┘

This is the same query as shown above, but with the binning moved into the group by context.

Group By / Agg (Stable)#

group_by/agg can also be used earlier in the data pipeline, before the private group_by/agg or select aggregation. This is a generalization of the sample and aggregate framework.

The approach is appealing because arbitrary expressions can be used in the agg argument, but it comes with the drawback that a large amount of data is needed to get reasonable utility.

The following query demonstrates how you can use the approach to compute arbitrary statistics, by first computing a statistic of interest (the min) on each of roughly 1000 groups, and then releasing a differentially private mean.

Python

>>> query_hwusual_binned = (
...     context.query()
...     .filter(pl.col.HWUSUAL != 99)
...     # group 1000 ways
...     .group_by(pl.col.PIDENT % 1000)
...     .agg(pl.col.HWUSUAL.min())
...     # up to 1000 records left to work with to compute a DP mean
...     .select(pl.col.HWUSUAL.cast(int).dp.mean((0, 30)))
... )
>>> query_hwusual_binned.summarize()
shape: (2, 4)
┌─────────┬───────────┬─────────────────┬─────────┐
│ column  ┆ aggregate ┆ distribution    ┆ scale   │
│ ---     ┆ ---       ┆ ---             ┆ ---     │
│ str     ┆ str       ┆ str             ┆ f64     │
╞═════════╪═══════════╪═════════════════╪═════════╡
│ HWUSUAL ┆ Sum       ┆ Integer Laplace ┆ 17280.0 │
│ HWUSUAL ┆ Length    ┆ Integer Laplace ┆ 576.0   │
└─────────┴───────────┴─────────────────┴─────────┘

The noise scale is also relatively large. The current configuration of the context doesn’t know that all records from a user share the same PIDENT. This information can be added when building the context:

Python

>>> context_pident = dp.Context.compositor(
...     data=pl.scan_csv(
...         dp.examples.get_france_lfs_path(),
...         ignore_errors=True,
...     ),
...     privacy_unit=dp.unit_of(
...         contributions=[
...             dp.polars.Bound(per_group=36),
...             # a user can only be in one group at a time when grouped this way
...             dp.polars.Bound(
...                 by=[pl.col.PIDENT % 1000], num_groups=1
...             ),
...         ]
...     ),
...     privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
...     split_evenly_over=4,
...     margins=[dp.polars.Margin(max_length=150_000 * 36)],
... )
>>> query_hwusual_binned = (
...     context_pident.query()
...     .filter(pl.col.HWUSUAL != 99)
...     # group 1000 ways
...     .group_by(pl.col.PIDENT % 1000)
...     .agg(pl.col.HWUSUAL.min())
...     # up to 1000 records left to work with to compute a DP mean
...     .select(pl.col.HWUSUAL.cast(int).dp.mean((0, 30)))
... )
>>> query_hwusual_binned.summarize()
shape: (2, 4)
┌─────────┬───────────┬─────────────────┬───────┐
│ column  ┆ aggregate ┆ distribution    ┆ scale │
│ ---     ┆ ---       ┆ ---             ┆ ---   │
│ str     ┆ str       ┆ str             ┆ f64   │
╞═════════╪═══════════╪═════════════════╪═══════╡
│ HWUSUAL ┆ Sum       ┆ Integer Laplace ┆ 480.0 │
│ HWUSUAL ┆ Length    ┆ Integer Laplace ┆ 16.0  │
└─────────┴───────────┴─────────────────┴───────┘

Adding this Bound reduced the noise scale by a factor of 36, because in the resulting dataset, only at most one record is changed, instead of 36. Nevertheless, the group_by/agg doubles the amount of noise necessary, because contributing one record results in a change of the aggregated record.

Branches

Releases

Preparing Microdata#

With Columns#

Select#

Filter#

Group By (Private)#

Group By / Agg (Stable)#