Questions or feedback?

opendp.extras.mbi package#

Module contents#

This module requires extra installs: pip install 'opendp[mbi]'

mbi is short for “marginal-based inference”, and is the name of the Private-PGM package. OpenDP uses Private-PGM to postprocess DP releases made with the OpenDP Library.

For convenience, all the members of this module are also available from opendp.prelude. We suggest importing under the conventional name dp:

>>> import opendp.prelude as dp

The members of this module will then be accessible at dp.mbi.

class opendp.extras.mbi.AIM(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None)[source]#

Bases: Algorithm

AIM mechanism from MMSM22.

Adaptively chooses and estimates the least-well-approximated marginal. The stronger the correlation amongst a clique of columns, the more likely AIM is to select the clique.

The algorithm starts with a small per-step privacy budget, and in each step increases the budget if the last measured marginal doesn’t sufficiently improve the model.

>>> import opendp.prelude as dp
>>> import polars as pl

>>> dp.enable_features("contrib")

>>> context = dp.Context.compositor(
...     data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
...     privacy_unit=dp.unit_of(contributions=36),
...     privacy_loss=dp.loss_of(rho=0.1, delta=1e-7),
... )

>>> table_aim = (
...     context.query(rho=0.1, delta=1e-7)
...     # transformations/truncation may be applied here
...     .select("SEX", "AGE", "HWUSUAL", "ILOSTAT")
...     .contingency_table(
...         keys={"SEX": [1, 2]}, 
...         cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]},
...         algorithm=dp.mbi.AIM()
...     )
...     .release()
... )

>>> table_aim.synthesize() 
shape: (3_807_732, 4)
┌─────┬───────────┬───────────┬─────────┐
│ SEX ┆ AGE       ┆ HWUSUAL   ┆ ILOSTAT │
│ --- ┆ ---       ┆ ---       ┆ ---     │
│ i64 ┆ f64       ┆ f64       ┆ i64     │
╞═════╪═══════════╪═══════════╪═════════╡
│ 1   ┆ 55.446336 ┆ 20.776579 ┆ 1       │
│ 1   ┆ 28.21838  ┆ 40.53348  ┆ 1       │
│ 2   ┆ 43.291215 ┆ 34.406155 ┆ 1       │
│ 1   ┆ 55.106615 ┆ 22.413161 ┆ 1       │
│ 2   ┆ 42.585227 ┆ 40.11279  ┆ 3       │
│ …   ┆ …         ┆ …         ┆ …       │
│ 1   ┆ 58.197292 ┆ 40.139579 ┆ 1       │
│ 1   ┆ 59.371221 ┆ 19.671153 ┆ 1       │
│ 2   ┆ 19.862917 ┆ 40.339046 ┆ 9       │
│ 1   ┆ 19.492355 ┆ 32.233661 ┆ 1       │
│ 2   ┆ 60.863244 ┆ 40.737908 ┆ 3       │
└─────┴───────────┴───────────┴─────────┘

The algorithm is similar to the Multiplicative Weights Exponential Mechanism (MWEM) introduced in HLM10, in that the exponential mechanism selects a marginal in each step.

AIM differs from MWEM in that the distribution is represented via a graphical model and fitted via mirror descent, instead of joint distribution densities and the multiplicative weights update rule. This allows AIM to support higher-dimensional datasets.

Parameters:
  • estimator (Callable) –

  • oneway (Literal['all', 'unkeyed']) –

  • oneway_split (float | None) –

  • queries (list[Count] | int) –

  • measure_split (float) –

  • max_size (float) –

  • rounds (int | None) –

make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#

Implements AIM (Adaptive Iterative Mechanism) for ordinal data.

Parameters:
  • input_domain (LazyFrameDomain) – domain of input data

  • input_metric (FrameDistance) – how to compute distance between datasets

  • output_measure (Measure) – how to measure privacy of release

  • d_in (list[Bound]) – distance between adjacent input datasets

  • d_out (float) – upper bound on the privacy loss

  • marginals (dict[tuple[str, ...], Any]) – prior marginal releases

  • model – warm-start fit of MarkovRandomField

Return type:

Measurement

max_size: float = 80.0#

Maximum memory constraint in MB for the marginal selection.

measure_split: float = 0.9#

Remaining proportion of budget to allocate to measuring marginals.

The complement is spent on selecting marginals.

queries: list[Count] | int = 3#

Explicit workload of interactions, or maximum degree of interactions to consider.

rounds: int | None = None#

Maximum number of rounds to run the algorithm.

class opendp.extras.mbi.Algorithm(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None)[source]#

Bases: ABC

Base class for configuration of contingency table algorithms.

Parameters:
  • estimator (Callable) –

  • oneway (Literal['all', 'unkeyed']) –

  • oneway_split (float | None) –

estimator(loss_fn, *, potentials=None)#

Optimizer to use to fit a MarkovRandomField.

Defaults to opendp.extras.mbi.mirror_descent(). Any function matching the signature of mirror_descent can be used to customize how the MarkovRandomField is optimized/estimated. See mbi.estimation for other optimizers.

abstract make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Parameters:
Return type:

Measurement

oneway: Literal['all', 'unkeyed'] = 'all'#

Fit one-way marginals for all columns, or only unkeyed columns.

oneway_split: float | None = None#

Proportion of budget to use for oneway release.

When oneway_split is not set, defaults to half of the budget.

If oneway is unkeyed, budget is further reduced by the proportion of columns with missing keys or cuts. That is, when all columns have keys, then oneway_split is zero, and when no columns have keys, then oneway_split is one-half.

class opendp.extras.mbi.ContingencyTable(keys, cuts, marginals, model, thresholds=<factory>)[source]#

Approximate a k-way contingency table via low-order marginals.

Parameters:
  • keys (dict[str, Any]) –

  • cuts (dict[str, Any]) –

  • marginals (dict[tuple[str, ...], Any]) –

  • model (Any) –

  • thresholds (dict[str, int]) –

cuts: dict[str, Any]#

Cut points (left-open) for each numeric column

keys: dict[str, Any]#

Unique keys for each column

marginals: dict[tuple[str, ...], Any]#

Mapping from clique name to LinearMeasurement used to fit the model

model: Any#

MarkovRandomField spanning the same columns as keys

project(attrs)[source]#

Return counts corresponding to each combination of self.keys in attrs in an ndarray.

Parameters:

attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.

project_melted(attrs)[source]#

Return counts corresponding to each combination of self.keys in attrs in a dataframe.

Parameters:

attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.

schema#

Returns the data schema.

std(attrs)[source]#

Estimate the standard deviation of the count estimate. The estimate does not take cuts into account.

When marginals are estimated with gaussian noise, the consistent (variance-weighted) marginals are also gaussian-distributed. Therefore the standard deviation matches the noise scale parameter.

When marginals are estimated with laplace noise, the consistent marginals are a sum of laplacians, which is not laplace-distributed. Therefore the standard deviation doesn’t correspond to the laplace distribution. Under the central limit theorem, the distribution will tend towards gaussian.

If the estimate is gaussian-distributed, use opendp.accuracy.gaussian_scale_to_accuracy() to construct a confidence interval.

Parameters:

attrs (Sequence[str]) – attributes to preserve in uncertainty estimate

Return type:

float

synthesize(rows=None, method='round')[source]#

Generate synthetic data that is consistent with the contingency table.

Parameters:
  • rows (int | None) – if not set, approximately matches number of rows in original data

  • method (Literal['round', 'sample']) – “round” rounds to projection, “sample” samples from densities

thresholds: dict[str, int]#

Cut-off point for discovered stable keys. Any category appearing fewer than threshold times is attributed to the null category.

class opendp.extras.mbi.Count(by, weight=1.0)[source]#

Denotes a count query.

Parameters:
  • by (tuple[str, ...]) –

  • weight (float) –

by: tuple[str, ...]#

Columns to group by.

weight: float = 1.0#

Importance of this count query.

  • Used by AIM to prioritize cliques.

  • Used by Fixed to distribute privacy budget.

class opendp.extras.mbi.Fixed(*, estimator=<function mirror_descent>, oneway='unkeyed', oneway_split=None, queries)[source]#

Bases: Algorithm

Releases all queries in a fixed workload.

Parameters:
  • estimator (Callable) –

  • oneway (Literal['all', 'unkeyed']) –

  • oneway_split (float | None) –

  • queries (list[Count]) –

make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#

Implements a “Fixed” algorithm over ordinal data.

The algorithm estimates fixed cliques.

Parameters:
  • input_domain (LazyFrameDomain) – domain of input data

  • input_metric (FrameDistance) – how to compute distance between datasets

  • output_measure (Measure) – how to measure privacy of release

  • d_in (list[Bound]) – distance between adjacent input datasets

  • d_out (float) – upper bound on the privacy loss

  • marginals (dict[tuple[str, ...], Any]) – prior marginal releases

  • model (Any) – warm-start fit of MarkovRandomField

oneway: Literal['all', 'unkeyed'] = 'unkeyed'#

Only fit one-way marginals for columns missing keys.

The fixed algorithm differs from other algorithms in that it only estimates marginals with missing keys, not all unknown first-order marginals.

queries: list[Count]#

Workload of queries.

class opendp.extras.mbi.MST(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, measure_split=0.9, num_selections=None)[source]#

Bases: Algorithm

MST mechanism from MMS21.

MST greedily chooses pairs of columns that are most poorly represented by the DP contingency table in a way that guarantees all columns become connected by a minimum spanning tree. MST then releases all of the selected marginals.

>>> import opendp.prelude as dp
>>> import polars as pl

>>> dp.enable_features("contrib")

>>> context = dp.Context.compositor(
...     data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
...     privacy_unit=dp.unit_of(contributions=36),
...     privacy_loss=dp.loss_of(rho=0.1, delta=1e-7),
... )

>>> table_mst = (
...     context.query(rho=0.1, delta=1e-7)
...     # transformations/truncation may be applied here
...     .select("SEX", "AGE", "HWUSUAL", "ILOSTAT")
...     .contingency_table(
...         keys={"SEX": [1, 2]}, 
...         cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]},
...         algorithm=dp.mbi.MST()
...     )
...     .release()
... )

>>> table_mst.synthesize() 
shape: (3_807_732, 4)
┌─────┬───────────┬───────────┬─────────┐
│ SEX ┆ AGE       ┆ HWUSUAL   ┆ ILOSTAT │
│ --- ┆ ---       ┆ ---       ┆ ---     │
│ i64 ┆ f64       ┆ f64       ┆ i64     │
╞═════╪═══════════╪═══════════╪═════════╡
│ 1   ┆ 55.446336 ┆ 20.776579 ┆ 1       │
│ 1   ┆ 28.21838  ┆ 40.53348  ┆ 1       │
│ 2   ┆ 43.291215 ┆ 34.406155 ┆ 1       │
│ 1   ┆ 55.106615 ┆ 22.413161 ┆ 1       │
│ 2   ┆ 42.585227 ┆ 40.11279  ┆ 3       │
│ …   ┆ …         ┆ …         ┆ …       │
│ 1   ┆ 58.197292 ┆ 40.139579 ┆ 1       │
│ 1   ┆ 59.371221 ┆ 19.671153 ┆ 1       │
│ 2   ┆ 19.862917 ┆ 40.339046 ┆ 9       │
│ 1   ┆ 19.492355 ┆ 32.233661 ┆ 1       │
│ 2   ┆ 60.863244 ┆ 40.737908 ┆ 3       │
└─────┴───────────┴───────────┴─────────┘
Parameters:
  • estimator (Callable) –

  • oneway (Literal['all', 'unkeyed']) –

  • oneway_split (float | None) –

  • measure_split (float) –

  • num_selections (int | None) –

make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#

Implements MST (Minimum Spanning Tree) over ordinal data.

Parameters:
  • input_domain (LazyFrameDomain) – domain of input data

  • input_metric (FrameDistance) – how to compute distance between datasets

  • output_measure (Measure) – how to measure privacy of release

  • d_in (list[Bound]) – distance between adjacent input datasets

  • d_out (float) – upper bound on the privacy loss

  • marginals (dict[tuple[str, ...], Any]) – prior marginal releases

  • model (Any) – warm-start fit of MarkovRandomField

Return type:

Measurement

measure_split: float = 0.9#

Remaining proportion of budget to allocate to measuring marginals.

The complement is spent on selecting marginals.

num_selections: int | None = None#

Number of second-order marginals to estimate.

Defaults to one fewer than the number of columns in the data.

class opendp.extras.mbi.Sequential(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, algorithms, weights=None)[source]#

Bases: Algorithm

Release multiple algorithms sequentially.

Use this in conjunction with Fixed to ensure specific marginals are estimated.

Parameters:
  • estimator (Callable) –

  • oneway (Literal['all', 'unkeyed']) –

  • oneway_split (float | None) –

  • algorithms (list[Algorithm]) –

  • weights (list[float] | None) –

algorithms: list[Algorithm]#

Sequence of algorithms.

make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#

Implements a “Sequential” algorithm over ordinal data.

The algorithm applies each algorithm sequentially.

Parameters:
  • input_domain (LazyFrameDomain) – domain of input data

  • input_metric (FrameDistance) – how to compute distance between datasets

  • output_measure (Measure) – how to measure privacy of release

  • d_in (list[Bound]) – distance between adjacent input datasets

  • d_out (float) – upper bound on the privacy loss

  • marginals (dict[tuple[str, ...], Any]) – prior marginal releases

  • model (Any) – warm-start fit of MarkovRandomField

weights: list[float] | None = None#

Budget allocation amongst algorithms.

Defaults to equal budget for each algorithm.

opendp.extras.mbi.make_contingency_table(input_domain, input_metric, output_measure, d_in, d_out, keys=None, cuts=None, table=None, algorithm=AIM(estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None))[source]#

Return a measurement that releases a ContingencyTable, as well as the noise scale and optional threshold for one-way marginals.

Parameters:
  • input_domain (LazyFrameDomain) – domain of input data

  • input_metric (FrameDistance) – how to compute distance between datasets

  • output_measure (Measure) – how to measure privacy of release

  • d_in (list[Bound]) – bounds on distances between adjacent input datasets

  • d_out (float | tuple[float, float]) – upper bound on the privacy loss

  • keys (Mapping[str, Sequence] | None) – dictionary of column names and unique categories

  • cuts (Mapping[str, Sequence[float]] | None) – dictionary of column names and left-open bin edges

  • table (ContingencyTable | None) – ContingencyTable from prior release

  • algorithm (Algorithm) – configuration for internal estimation algorithm

Return type:

tuple[Measurement, float | None, int | None]

opendp.extras.mbi.mirror_descent(domain, loss_fn, *, potentials=None)[source]#

Fit a MarkovRandomField over the domain and loss function using mirror descent.

Replicate the API of this function to use other optimizers from Private-PGM.