opendp.extras.mbi package#
Module contents#
This module requires extra installs: pip install 'opendp[mbi]'
mbi
is short for “marginal-based inference”,
and is the name of the Private-PGM package.
OpenDP uses Private-PGM to postprocess DP releases made with the OpenDP Library.
For convenience, all the members of this module are also available from opendp.prelude
.
We suggest importing under the conventional name dp
:
>>> import opendp.prelude as dp
The members of this module will then be accessible at dp.mbi
.
- class opendp.extras.mbi.AIM(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None)[source]#
Bases:
Algorithm
AIM mechanism from MMSM22.
Adaptively chooses and estimates the least-well-approximated marginal. The stronger the correlation amongst a clique of columns, the more likely AIM is to select the clique.
The algorithm starts with a small per-step privacy budget, and in each step increases the budget if the last measured marginal doesn’t sufficiently improve the model.
>>> import opendp.prelude as dp >>> import polars as pl >>> dp.enable_features("contrib") >>> context = dp.Context.compositor( ... data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True), ... privacy_unit=dp.unit_of(contributions=36), ... privacy_loss=dp.loss_of(rho=0.1, delta=1e-7), ... ) >>> table_aim = ( ... context.query(rho=0.1, delta=1e-7) ... # transformations/truncation may be applied here ... .select("SEX", "AGE", "HWUSUAL", "ILOSTAT") ... .contingency_table( ... keys={"SEX": [1, 2]}, ... cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]}, ... algorithm=dp.mbi.AIM() ... ) ... .release() ... ) >>> table_aim.synthesize() shape: (3_807_732, 4) ┌─────┬───────────┬───────────┬─────────┐ │ SEX ┆ AGE ┆ HWUSUAL ┆ ILOSTAT │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ i64 │ ╞═════╪═══════════╪═══════════╪═════════╡ │ 1 ┆ 55.446336 ┆ 20.776579 ┆ 1 │ │ 1 ┆ 28.21838 ┆ 40.53348 ┆ 1 │ │ 2 ┆ 43.291215 ┆ 34.406155 ┆ 1 │ │ 1 ┆ 55.106615 ┆ 22.413161 ┆ 1 │ │ 2 ┆ 42.585227 ┆ 40.11279 ┆ 3 │ │ … ┆ … ┆ … ┆ … │ │ 1 ┆ 58.197292 ┆ 40.139579 ┆ 1 │ │ 1 ┆ 59.371221 ┆ 19.671153 ┆ 1 │ │ 2 ┆ 19.862917 ┆ 40.339046 ┆ 9 │ │ 1 ┆ 19.492355 ┆ 32.233661 ┆ 1 │ │ 2 ┆ 60.863244 ┆ 40.737908 ┆ 3 │ └─────┴───────────┴───────────┴─────────┘
The algorithm is similar to the Multiplicative Weights Exponential Mechanism (MWEM) introduced in HLM10, in that the exponential mechanism selects a marginal in each step.
AIM differs from MWEM in that the distribution is represented via a graphical model and fitted via mirror descent, instead of joint distribution densities and the multiplicative weights update rule. This allows AIM to support higher-dimensional datasets.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
queries (list[Count] | int) –
measure_split (float) –
max_size (float) –
rounds (int | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements AIM (Adaptive Iterative Mechanism) for ordinal data.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model – warm-start fit of MarkovRandomField
- Return type:
- max_size: float = 80.0#
Maximum memory constraint in MB for the marginal selection.
- measure_split: float = 0.9#
Remaining proportion of budget to allocate to measuring marginals.
The complement is spent on selecting marginals.
- queries: list[Count] | int = 3#
Explicit workload of interactions, or maximum degree of interactions to consider.
- rounds: int | None = None#
Maximum number of rounds to run the algorithm.
- class opendp.extras.mbi.Algorithm(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None)[source]#
Bases:
ABC
Base class for configuration of contingency table algorithms.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
- estimator(loss_fn, *, potentials=None)#
Optimizer to use to fit a MarkovRandomField.
Defaults to
opendp.extras.mbi.mirror_descent()
. Any function matching the signature ofmirror_descent
can be used to customize how the MarkovRandomField is optimized/estimated. See mbi.estimation for other optimizers.
- abstract make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
- Parameters:
input_domain (LazyFrameDomain) –
input_metric (FrameDistance) –
output_measure (Measure) –
d_in (list[Bound]) –
d_out (float) –
marginals (dict[tuple[str, ...], Any]) –
- Return type:
- oneway: Literal['all', 'unkeyed'] = 'all'#
Fit one-way marginals for all columns, or only unkeyed columns.
- oneway_split: float | None = None#
Proportion of budget to use for oneway release.
When
oneway_split
is not set, defaults to half of the budget.If oneway is
unkeyed
, budget is further reduced by the proportion of columns with missing keys or cuts. That is, when all columns have keys, thenoneway_split
is zero, and when no columns have keys, thenoneway_split
is one-half.
- class opendp.extras.mbi.ContingencyTable(keys, cuts, marginals, model, thresholds=<factory>)[source]#
Approximate a k-way contingency table via low-order marginals.
- Parameters:
keys (dict[str, Any]) –
cuts (dict[str, Any]) –
marginals (dict[tuple[str, ...], Any]) –
model (Any) –
thresholds (dict[str, int]) –
- cuts: dict[str, Any]#
Cut points (left-open) for each numeric column
- keys: dict[str, Any]#
Unique keys for each column
- marginals: dict[tuple[str, ...], Any]#
Mapping from clique name to LinearMeasurement used to fit the model
- model: Any#
MarkovRandomField spanning the same columns as keys
- project(attrs)[source]#
Return counts corresponding to each combination of self.keys in attrs in an ndarray.
- Parameters:
attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.
- project_melted(attrs)[source]#
Return counts corresponding to each combination of self.keys in attrs in a dataframe.
- Parameters:
attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.
- schema#
Returns the data schema.
- std(attrs)[source]#
Estimate the standard deviation of the count estimate. The estimate does not take cuts into account.
When marginals are estimated with gaussian noise, the consistent (variance-weighted) marginals are also gaussian-distributed. Therefore the standard deviation matches the noise scale parameter.
When marginals are estimated with laplace noise, the consistent marginals are a sum of laplacians, which is not laplace-distributed. Therefore the standard deviation doesn’t correspond to the laplace distribution. Under the central limit theorem, the distribution will tend towards gaussian.
If the estimate is gaussian-distributed, use
opendp.accuracy.gaussian_scale_to_accuracy()
to construct a confidence interval.- Parameters:
attrs (Sequence[str]) – attributes to preserve in uncertainty estimate
- Return type:
float
- synthesize(rows=None, method='round')[source]#
Generate synthetic data that is consistent with the contingency table.
- Parameters:
rows (int | None) – if not set, approximately matches number of rows in original data
method (Literal['round', 'sample']) – “round” rounds to projection, “sample” samples from densities
- thresholds: dict[str, int]#
Cut-off point for discovered stable keys. Any category appearing fewer than threshold times is attributed to the null category.
- class opendp.extras.mbi.Count(by, weight=1.0)[source]#
Denotes a count query.
- Parameters:
by (tuple[str, ...]) –
weight (float) –
- by: tuple[str, ...]#
Columns to group by.
- weight: float = 1.0#
Importance of this count query.
Used by AIM to prioritize cliques.
Used by Fixed to distribute privacy budget.
- class opendp.extras.mbi.Fixed(*, estimator=<function mirror_descent>, oneway='unkeyed', oneway_split=None, queries)[source]#
Bases:
Algorithm
Releases all queries in a fixed workload.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
queries (list[Count]) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements a “Fixed” algorithm over ordinal data.
The algorithm estimates fixed cliques.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- oneway: Literal['all', 'unkeyed'] = 'unkeyed'#
Only fit one-way marginals for columns missing keys.
The fixed algorithm differs from other algorithms in that it only estimates marginals with missing keys, not all unknown first-order marginals.
- class opendp.extras.mbi.MST(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, measure_split=0.9, num_selections=None)[source]#
Bases:
Algorithm
MST mechanism from MMS21.
MST greedily chooses pairs of columns that are most poorly represented by the DP contingency table in a way that guarantees all columns become connected by a minimum spanning tree. MST then releases all of the selected marginals.
>>> import opendp.prelude as dp >>> import polars as pl >>> dp.enable_features("contrib") >>> context = dp.Context.compositor( ... data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True), ... privacy_unit=dp.unit_of(contributions=36), ... privacy_loss=dp.loss_of(rho=0.1, delta=1e-7), ... ) >>> table_mst = ( ... context.query(rho=0.1, delta=1e-7) ... # transformations/truncation may be applied here ... .select("SEX", "AGE", "HWUSUAL", "ILOSTAT") ... .contingency_table( ... keys={"SEX": [1, 2]}, ... cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]}, ... algorithm=dp.mbi.MST() ... ) ... .release() ... ) >>> table_mst.synthesize() shape: (3_807_732, 4) ┌─────┬───────────┬───────────┬─────────┐ │ SEX ┆ AGE ┆ HWUSUAL ┆ ILOSTAT │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ i64 │ ╞═════╪═══════════╪═══════════╪═════════╡ │ 1 ┆ 55.446336 ┆ 20.776579 ┆ 1 │ │ 1 ┆ 28.21838 ┆ 40.53348 ┆ 1 │ │ 2 ┆ 43.291215 ┆ 34.406155 ┆ 1 │ │ 1 ┆ 55.106615 ┆ 22.413161 ┆ 1 │ │ 2 ┆ 42.585227 ┆ 40.11279 ┆ 3 │ │ … ┆ … ┆ … ┆ … │ │ 1 ┆ 58.197292 ┆ 40.139579 ┆ 1 │ │ 1 ┆ 59.371221 ┆ 19.671153 ┆ 1 │ │ 2 ┆ 19.862917 ┆ 40.339046 ┆ 9 │ │ 1 ┆ 19.492355 ┆ 32.233661 ┆ 1 │ │ 2 ┆ 60.863244 ┆ 40.737908 ┆ 3 │ └─────┴───────────┴───────────┴─────────┘
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
measure_split (float) –
num_selections (int | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements MST (Minimum Spanning Tree) over ordinal data.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- Return type:
- measure_split: float = 0.9#
Remaining proportion of budget to allocate to measuring marginals.
The complement is spent on selecting marginals.
- num_selections: int | None = None#
Number of second-order marginals to estimate.
Defaults to one fewer than the number of columns in the data.
- class opendp.extras.mbi.Sequential(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, algorithms, weights=None)[source]#
Bases:
Algorithm
Release multiple algorithms sequentially.
Use this in conjunction with Fixed to ensure specific marginals are estimated.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
algorithms (list[Algorithm]) –
weights (list[float] | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements a “Sequential” algorithm over ordinal data.
The algorithm applies each algorithm sequentially.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- weights: list[float] | None = None#
Budget allocation amongst algorithms.
Defaults to equal budget for each algorithm.
- opendp.extras.mbi.make_contingency_table(input_domain, input_metric, output_measure, d_in, d_out, keys=None, cuts=None, table=None, algorithm=AIM(estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None))[source]#
Return a measurement that releases a
ContingencyTable
, as well as the noise scale and optional threshold for one-way marginals.- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – bounds on distances between adjacent input datasets
d_out (float | tuple[float, float]) – upper bound on the privacy loss
keys (Mapping[str, Sequence] | None) – dictionary of column names and unique categories
cuts (Mapping[str, Sequence[float]] | None) – dictionary of column names and left-open bin edges
table (ContingencyTable | None) – ContingencyTable from prior release
algorithm (Algorithm) – configuration for internal estimation algorithm
- Return type:
tuple[Measurement, float | None, int | None]
- opendp.extras.mbi.mirror_descent(domain, loss_fn, *, potentials=None)[source]#
Fit a MarkovRandomField over the domain and loss function using mirror descent.
Replicate the API of this function to use other optimizers from Private-PGM.