Contingency Tables#
A contingency table counts the number of records in each group, when grouping by the columns in a dataset. Contingency tables capture the relationships between columns in your data.
As the number of columns in the contingency table grows, the number of possible combinations of keys (AKA groups) increases exponentially quickly. This results in the contingency table becoming too large to hold in memory or measure with any utility. For this reason, it is recommended to instead estimate specific relationships between smaller sets of two or maybe three columns by releasing marginals (counts when data is grouped by some subset of the columns).
OpenDP uses Private-PGM
to build a contingency table that is consistent with a set of marginal queries via a process called model-based inference (mbi
).
This functionality is not enabled by default.
Add [mbi]
to the package installation to enable model-based-inference functionality.
pip install 'opendp[mbi]'
This installs the specific version of the Private-PGM package that is compatible with the OpenDP Library.
mbi
(Private-PGM) is only available in Python.
mbi
(Private-PGM) is only available in Python.
Let’s get started by setting up the context for the Labor Force dataset.
>>> import opendp.prelude as dp
>>> import polars as pl
>>> dp.enable_features("contrib")
>>> context = dp.Context.compositor(
... data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
... privacy_unit=dp.unit_of(contributions=36),
... privacy_loss=dp.loss_of(rho=0.2, delta=1e-7),
... )
.contingency_table
releases an approximation to a contingency table over all columns
by estimating lower order marginals.
To improve the utility of the mechanism,
specify the key-sets for categorical data, and cut-sets for numerical data.
The mechanism will release stable keys for any columns that don’t have key- or cut-sets.
>>> query = (
... context.query(rho=0.1, delta=1e-7)
... # transformations/truncation may be applied here
... .select("SEX", "AGE", "HWUSUAL", "ILOSTAT")
... .contingency_table(
... keys={"SEX": [1, 2]},
... cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]},
... algorithm=dp.mbi.Fixed(
... queries=[
... dp.mbi.Count(("AGE", "HWUSUAL")),
... dp.mbi.Count(("HWUSUAL", "ILOSTAT")),
... ],
... # proportion of budget to spend estimating one-way marginal key-sets
... # (defaults to num_unknown / num_columns / 2)
... oneway_split=1 / 8
... )
... )
... )
By default, up to half of the privacy budget is used to estimate one-way marginals and their keys.
The privacy budget used depends on the proportion of columns with unknown keys or cuts.
For example, in the prior query, keys are specified for all columns except ILOSTAT
,
so an eighth of the budget is used to estimate a one-way marginal on ILOSTAT
.
You can configure the privacy budget split for one-way marginals and key release
via the oneway_split
attribute on the algorithm.
Before releasing, you can view the noise scale and threshold to be used when estimating one-way key-sets.
>>> query.oneway_scale
227.68399153212334
>>> query.oneway_threshold
1364
In this setting, the scale and threshold are reasonably small. To make the threshold smaller, consider adding more key-sets or bounding the number of groups a user may contribute.
When you release, all marginals are estimated and stored inside a opendp.extras.mbi.ContingencyTable
.
The contingency table supports counting queries over arbitrary sets of grouping columns.
>>> table: dp.mbi.ContingencyTable = query.release()
Any keys in the data that are not present in the key-set,
either because they are missing from the key-set (like a third category for SEX
)
or because they had frequencies too low to meet the threshold (like the uncommon ILOSTAT
category 5
),
are replaced with null
when estimating higher-order marginals.
>>> table.keys["ILOSTAT"]
shape: (5,)
Series: 'ILOSTAT' [i64]
[
1
2
3
9
null
]
In this example, infrequent ILOSTAT
categories of 4
and 5
were absorbed into the null
key.
Preserve true nulls in your data by preprocessing them into another key via .select
or .with_columns
.
To see counts of the number of records corresponding to each key,
project the contingency table down to just ILOSTAT
.
>>> table.project(("ILOSTAT",)).astype(int)
Array([1515729, 155912, 1450748, 689452, 109], dtype=int64)
The same projection can be viewed in a melted dataframe form:
>>> table.project_melted(("ILOSTAT",))
shape: (5, 2)
┌─────────┬──────────────┐
│ ILOSTAT ┆ len │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════╪══════════════╡
│ 1 ┆ 1.5157e6 │
│ 2 ┆ 156011.15206 │
│ 3 ┆ 1.4509e6 │
│ 9 ┆ 689551.39267 │
│ null ┆ 119.022842 │
└─────────┴──────────────┘
Since ("ILOSTAT",)
is covered by the query workload,
an estimate of the standard deviation can be derived.
Since all noise added is gaussian-distributed,
the resulting noise distribution remains approximately gaussian-distributed,
so it is possible to construct a confidence interval for each scalar in the projection.
>>> scale = table.std(("ILOSTAT",))
>>> dp.gaussian_scale_to_accuracy(scale, alpha=.05)
446.2524232592843
That is, the true count differs from the estimate by no more than the accuracy estimate,
with (1 - alpha)100%
confidence.
Adaptive Estimation#
Now consider the SEX
column, which contains three keys:
the two specified in keys
, as well as null
for any records not in the key set.
>>> table.keys["SEX"]
shape: (3,)
Series: 'SEX' [i64]
[
1
2
null
]
SEX
is not well-supported by the contingency table,
because it is not included in any of the released marginals.
As a result, the best estimate for the total number of records is distributed uniformly over the key set,
and the standard deviation is unknown.
>>> table.project(("SEX",)).astype(int)
Array([1270561, 1270561, 1270561], dtype=int64)
>>> table.std(("SEX",))
Traceback (most recent call last):
...
ValueError: attrs (('SEX',)) are not covered by the query set
To improve the utility of this projection, you could release an updated contingency table with more marginals.
>>> table2 = (
... context.query(rho=0.05, delta=0.0)
... .select("SEX")
... .contingency_table(
... table=table,
... algorithm=dp.mbi.Fixed(queries=[dp.mbi.Count(("SEX",))]),
... )
... .release()
... )
The updated table now much more accurately reflects the distribution of counts over SEX
.
>>> table2.project(("SEX",)).astype(int)
Array([1828251, 1982295, 1228], dtype=int64)
>>> table2.std(("SEX",))
113.84199576606166
The contingency table mechanism can satisfy either pure-DP (epsilon) or zCDP (rho). If approximate-DP (delta) is not enabled, then the key-set and cut-set must be exhaustive. Since microdata is never materialized outside of Polars, datasets can be as large in size as can be handled by the Polars library.
The sensitivity of marginals released by the mechanism are derived in the same fashion as any other OpenDP Polars marginal query, meaning they benefit from bounds on user contribution defined in the privacy unit or acquired through identifier truncation.