opendp.extras.mbi package#
Module contents#
This module requires extra installs: pip install 'opendp[mbi]'
mbi is short for “marginal-based inference”,
and is the name of the Private-PGM package.
OpenDP uses Private-PGM to postprocess DP releases made with the OpenDP Library.
For convenience, all the members of this module are also available from opendp.prelude.
We suggest importing under the conventional name dp:
>>> import opendp.prelude as dp
The members of this module will then be accessible at dp.mbi.
- class opendp.extras.mbi.AIM(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None)[source]#
Bases:
AlgorithmAIM mechanism from MMSM22.
Adaptively chooses and estimates the least-well-approximated marginal. The stronger the correlation amongst a clique of columns, the more likely AIM is to select the clique.
The algorithm starts with a small per-step privacy budget, and in each step increases the budget if the last measured marginal doesn’t sufficiently improve the model.
>>> import opendp.prelude as dp >>> import polars as pl >>> dp.enable_features("contrib") >>> context = dp.Context.compositor( ... data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True), ... privacy_unit=dp.unit_of(contributions=36), ... privacy_loss=dp.loss_of(rho=0.1, delta=1e-7), ... ) >>> table_aim = ( ... context.query(rho=0.1, delta=1e-7) ... # transformations/truncation may be applied here ... .select("SEX", "AGE", "HWUSUAL", "ILOSTAT") ... .contingency_table( ... keys={"SEX": [1, 2]}, ... cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]}, ... algorithm=dp.mbi.AIM() ... ) ... .release() ... ) >>> table_aim.synthesize() shape: (3_807_732, 4) ┌─────┬───────────┬───────────┬─────────┐ │ SEX ┆ AGE ┆ HWUSUAL ┆ ILOSTAT │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ i64 │ ╞═════╪═══════════╪═══════════╪═════════╡ │ 1 ┆ 55.446336 ┆ 20.776579 ┆ 1 │ │ 1 ┆ 28.21838 ┆ 40.53348 ┆ 1 │ │ 2 ┆ 43.291215 ┆ 34.406155 ┆ 1 │ │ 1 ┆ 55.106615 ┆ 22.413161 ┆ 1 │ │ 2 ┆ 42.585227 ┆ 40.11279 ┆ 3 │ │ … ┆ … ┆ … ┆ … │ │ 1 ┆ 58.197292 ┆ 40.139579 ┆ 1 │ │ 1 ┆ 59.371221 ┆ 19.671153 ┆ 1 │ │ 2 ┆ 19.862917 ┆ 40.339046 ┆ 9 │ │ 1 ┆ 19.492355 ┆ 32.233661 ┆ 1 │ │ 2 ┆ 60.863244 ┆ 40.737908 ┆ 3 │ └─────┴───────────┴───────────┴─────────┘
The algorithm is similar to the Multiplicative Weights Exponential Mechanism (MWEM) introduced in HLM10, in that the exponential mechanism selects a marginal in each step.
AIM differs from MWEM in that the distribution is represented via a graphical model and fitted via mirror descent, instead of joint distribution densities and the multiplicative weights update rule. This allows AIM to support higher-dimensional datasets.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
queries (list[Count] | int) –
measure_split (float) –
max_size (float) –
rounds (int | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements AIM (Adaptive Iterative Mechanism) for ordinal data.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model – warm-start fit of MarkovRandomField
- Return type:
- max_size: float = 80.0#
Maximum memory constraint in MB for the marginal selection.
- measure_split: float = 0.9#
Remaining proportion of budget to allocate to measuring marginals.
The complement is spent on selecting marginals.
- queries: list[Count] | int = 3#
Explicit workload of interactions, or maximum degree of interactions to consider.
- rounds: int | None = None#
Maximum number of rounds to run the algorithm.
- class opendp.extras.mbi.Algorithm(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None)[source]#
Bases:
ABCBase class for configuration of contingency table algorithms.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
- estimator(loss_fn, *, potentials=None)#
Optimizer to use to fit a MarkovRandomField.
Defaults to
opendp.extras.mbi.mirror_descent(). Any function matching the signature ofmirror_descentcan be used to customize how the MarkovRandomField is optimized/estimated. See mbi.estimation for other optimizers.
- abstract make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
- Parameters:
input_domain (LazyFrameDomain) –
input_metric (FrameDistance) –
output_measure (Measure) –
d_in (list[Bound]) –
d_out (float) –
marginals (dict[tuple[str, ...], Any]) –
- Return type:
- oneway: Literal['all', 'unkeyed'] = 'all'#
Fit one-way marginals for all columns, or only unkeyed columns.
- oneway_split: float | None = None#
Proportion of budget to use for oneway release.
When
oneway_splitis not set, defaults to half of the budget.If oneway is
unkeyed, budget is further reduced by the proportion of columns with missing keys or cuts. That is, when all columns have keys, thenoneway_splitis zero, and when no columns have keys, thenoneway_splitis one-half.
- class opendp.extras.mbi.ContingencyTable(keys, cuts, marginals, model, thresholds=<factory>)[source]#
Approximate a k-way contingency table via low-order marginals.
- Parameters:
keys (dict[str, Any]) –
cuts (dict[str, Any]) –
marginals (dict[tuple[str, ...], Any]) –
model (Any) –
thresholds (dict[str, int]) –
- cuts: dict[str, Any]#
Cut points (left-open) for each numeric column
- keys: dict[str, Any]#
Unique keys for each column
- marginals: dict[tuple[str, ...], Any]#
Mapping from clique name to LinearMeasurement used to fit the model
- model: Any#
MarkovRandomField spanning the same columns as keys
- project(attrs)[source]#
Return counts corresponding to each combination of self.keys in attrs in an ndarray.
- Parameters:
attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.
- project_melted(attrs)[source]#
Return counts corresponding to each combination of self.keys in attrs in a dataframe.
- Parameters:
attrs (str | Sequence[str]) – attributes to preserve. All other attributes are marginalized.
- schema#
Returns the data schema.
- std(attrs)[source]#
Estimate the standard deviation of the count estimate. The estimate does not take cuts into account.
When marginals are estimated with gaussian noise, the consistent (variance-weighted) marginals are also gaussian-distributed. Therefore the standard deviation matches the noise scale parameter.
When marginals are estimated with laplace noise, the consistent marginals are a sum of laplacians, which is not laplace-distributed. Therefore the standard deviation doesn’t correspond to the laplace distribution. Under the central limit theorem, the distribution will tend towards gaussian.
If the estimate is gaussian-distributed, use
opendp.accuracy.gaussian_scale_to_accuracy()to construct a confidence interval.- Parameters:
attrs (Sequence[str]) – attributes to preserve in uncertainty estimate
- Return type:
float
- synthesize(rows=None, method='round')[source]#
Generate synthetic data that is consistent with the contingency table.
- Parameters:
rows (int | None) – if not set, approximately matches number of rows in original data
method (Literal['round', 'sample']) – “round” rounds to projection, “sample” samples from densities
- thresholds: dict[str, int]#
Cut-off point for discovered stable keys. Any category appearing fewer than threshold times is attributed to the null category.
- class opendp.extras.mbi.Count(by, weight=1.0)[source]#
Denotes a count query.
- Parameters:
by (tuple[str, ...]) –
weight (float) –
- by: tuple[str, ...]#
Columns to group by.
- weight: float = 1.0#
Importance of this count query.
Used by AIM to prioritize cliques.
Used by Fixed to distribute privacy budget.
- class opendp.extras.mbi.Fixed(*, estimator=<function mirror_descent>, oneway='unkeyed', oneway_split=None, queries)[source]#
Bases:
AlgorithmReleases all queries in a fixed workload.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
queries (list[Count]) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements a “Fixed” algorithm over ordinal data.
The algorithm estimates fixed cliques.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- oneway: Literal['all', 'unkeyed'] = 'unkeyed'#
Only fit one-way marginals for columns missing keys.
The fixed algorithm differs from other algorithms in that it only estimates marginals with missing keys, not all unknown first-order marginals.
- class opendp.extras.mbi.MST(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, measure_split=0.9, num_selections=None)[source]#
Bases:
AlgorithmMST mechanism from MMS21.
MST greedily chooses pairs of columns that are most poorly represented by the DP contingency table in a way that guarantees all columns become connected by a minimum spanning tree. MST then releases all of the selected marginals.
>>> import opendp.prelude as dp >>> import polars as pl >>> dp.enable_features("contrib") >>> context = dp.Context.compositor( ... data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True), ... privacy_unit=dp.unit_of(contributions=36), ... privacy_loss=dp.loss_of(rho=0.1, delta=1e-7), ... ) >>> table_mst = ( ... context.query(rho=0.1, delta=1e-7) ... # transformations/truncation may be applied here ... .select("SEX", "AGE", "HWUSUAL", "ILOSTAT") ... .contingency_table( ... keys={"SEX": [1, 2]}, ... cuts={"AGE": [20, 40, 60], "HWUSUAL": [1, 20, 40]}, ... algorithm=dp.mbi.MST() ... ) ... .release() ... ) >>> table_mst.synthesize() shape: (3_807_732, 4) ┌─────┬───────────┬───────────┬─────────┐ │ SEX ┆ AGE ┆ HWUSUAL ┆ ILOSTAT │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ i64 │ ╞═════╪═══════════╪═══════════╪═════════╡ │ 1 ┆ 55.446336 ┆ 20.776579 ┆ 1 │ │ 1 ┆ 28.21838 ┆ 40.53348 ┆ 1 │ │ 2 ┆ 43.291215 ┆ 34.406155 ┆ 1 │ │ 1 ┆ 55.106615 ┆ 22.413161 ┆ 1 │ │ 2 ┆ 42.585227 ┆ 40.11279 ┆ 3 │ │ … ┆ … ┆ … ┆ … │ │ 1 ┆ 58.197292 ┆ 40.139579 ┆ 1 │ │ 1 ┆ 59.371221 ┆ 19.671153 ┆ 1 │ │ 2 ┆ 19.862917 ┆ 40.339046 ┆ 9 │ │ 1 ┆ 19.492355 ┆ 32.233661 ┆ 1 │ │ 2 ┆ 60.863244 ┆ 40.737908 ┆ 3 │ └─────┴───────────┴───────────┴─────────┘
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
measure_split (float) –
num_selections (int | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements MST (Minimum Spanning Tree) over ordinal data.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- Return type:
- measure_split: float = 0.9#
Remaining proportion of budget to allocate to measuring marginals.
The complement is spent on selecting marginals.
- num_selections: int | None = None#
Number of second-order marginals to estimate.
Defaults to one fewer than the number of columns in the data.
- class opendp.extras.mbi.Sequential(*, estimator=<function mirror_descent>, oneway='all', oneway_split=None, algorithms, weights=None)[source]#
Bases:
AlgorithmRelease multiple algorithms sequentially.
Use this in conjunction with Fixed to ensure specific marginals are estimated.
- Parameters:
estimator (Callable) –
oneway (Literal['all', 'unkeyed']) –
oneway_split (float | None) –
algorithms (list[Algorithm]) –
weights (list[float] | None) –
- make_marginals(input_domain, input_metric, output_measure, d_in, d_out, *, marginals, model)[source]#
Implements a “Sequential” algorithm over ordinal data.
The algorithm applies each algorithm sequentially.
- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – distance between adjacent input datasets
d_out (float) – upper bound on the privacy loss
marginals (dict[tuple[str, ...], Any]) – prior marginal releases
model (Any) – warm-start fit of MarkovRandomField
- weights: list[float] | None = None#
Budget allocation amongst algorithms.
Defaults to equal budget for each algorithm.
- opendp.extras.mbi.make_contingency_table(input_domain, input_metric, output_measure, d_in, d_out, keys=None, cuts=None, table=None, algorithm=AIM(estimator=<function mirror_descent>, oneway='all', oneway_split=None, queries=3, measure_split=0.9, max_size=80.0, rounds=None))[source]#
Return a measurement that releases a
ContingencyTable, as well as the noise scale and optional threshold for one-way marginals.- Parameters:
input_domain (LazyFrameDomain) – domain of input data
input_metric (FrameDistance) – how to compute distance between datasets
output_measure (Measure) – how to measure privacy of release
d_in (list[Bound]) – bounds on distances between adjacent input datasets
d_out (float | tuple[float, float]) – upper bound on the privacy loss
keys (Mapping[str, Sequence] | None) – dictionary of column names and unique categories
cuts (Mapping[str, Sequence[float]] | None) – dictionary of column names and left-open bin edges
table (ContingencyTable | None) – ContingencyTable from prior release
algorithm (Algorithm) – configuration for internal estimation algorithm
- Return type:
tuple[Measurement, float | None, int | None]
- opendp.extras.mbi.mirror_descent(domain, loss_fn, *, potentials=None)[source]#
Fit a MarkovRandomField over the domain and loss function using mirror descent.
Replicate the API of this function to use other optimizers from Private-PGM.