Questions or feedback?

Report a bug or request a feature on Github.
Send general queries to info@opendp.org, or email security@opendp.org if it is related to security.
Join the conversation on Slack, or the mailing list.

opendp.extras.polars package#

Module contents#

This module requires extra installs: pip install 'opendp[polars]'

The opendp.extras.polars module adds differential privacy to the Polars DataFrame library.

For convenience, all the members of this module are also available from opendp.prelude. We suggest importing under the conventional name dp:

>>> import opendp.prelude as dp

The methods of this module will then be accessible at dp.polars.

class opendp.extras.polars.DPExpr(expr)[source]#

If both opendp and polars have been imported, the methods of DPExpr are registered under the dp namespace in Polars expressions. An expression can be used as a plan in opendp.measurements.make_private_lazyframe(); See the full example there for more information.

In addition to the DP-specific methods here, many Polars Expr methods are also supported, and are documented in the API User Guide.

This class is typically not used directly by users: Instead its methods are registered under the dp namespace of Polars expressions.

>>> import polars as pl
>>> pl.len().dp
<opendp.extras.polars.DPExpr object at ...>

count(scale=None)[source]#

Compute a differentially private estimate of the number of elements in self, not including null values.

This function is a shortcut for the exact Polars count and then noise addition.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – parameter for the noise distribution.
Example:

Count the number of records with known (non-null) visits:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.col("visits").dp.count())
>>> query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ u32    │
╞════════╡
│ ...    │
└────────┘

Output is noise added to three.

gaussian(scale=None)[source]#

Add Gaussian noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – Noise scale parameter for the Gaussian distribution. scale == standard_deviation
Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"A": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(rho=0.5),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.len().dp.gaussian())
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ ... │
└─────┘

laplace(scale=None)[source]#

Add Laplace noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – Noise scale parameter for the Laplace distribution. scale == standard_deviation / sqrt(2)
Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"A": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.len().dp.laplace())
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ ... │
└─────┘

len(scale=None)[source]#

Compute a differentially private estimate of the number of elements in self, including null values.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – parameter for the noise distribution.
Example:

This function is a shortcut for the exact Polars len and then noise addition:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.col("visits").dp.len())
>>> query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ u32    │
╞════════╡
│ ...    │
└────────┘

Output is noise added to three.

It can differ from frame length (.select(dp.len())) if the expression uses transformations that change the number of rows, like filtering.

mean(bounds, scale=(None, None))[source]#

Compute the differentially private mean.

The amount of noise to be added to the sum is determined by the scale. If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:

bounds (tuple[float, float]) – clip the input data to these lower and upper bounds
scale (tuple[float | None, float | None]) – parameters for the noise distributions of the numerator and denominator

Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
...     margins=[dp.polars.Margin(max_partition_length=5)]
... )
>>> query = context.query().select(pl.col("visits").fill_null(0).dp.mean((0, 1)))
>>> with pl.Config(float_precision=0): # just to prevent doctest from failing
...     query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ f64    │
╞════════╡
│ ...... │
└────────┘

Privately estimates the numerator and denominator separately, and then returns their ratio.

median(candidates, scale=None)[source]#

Compute a differentially private median.

The scale calibrates the level of entropy when selecting a candidate. If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:

candidates (list[float]) – Potential quantiles to select from.
scale (float | None) – How much noise to add to the scores of candidate.

Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"age": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
...     margins=[dp.polars.Margin(max_partition_length=100)]
... )
>>> candidates = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> query = context.query().select(pl.col("age").fill_null(0).dp.quantile(0.5, candidates))
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ age │
│ --- │
│ i64 │
╞═════╡
│ ... │
└─────┘

Output will be one of the candidates, with greater likelihood of being selected the closer the candidate is to the median.

n_unique(scale=None)[source]#

Compute a differentially private estimate of the number of unique elements in self.

This function is a shortcut for the exact Polars n_unique and then noise addition.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – parameter for the noise distribution.
Example:

Count the number of unique addresses:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.col("visits").dp.n_unique())
>>> query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ u32    │
╞════════╡
│ ...    │
└────────┘

Output is noise added to three.

noise(scale=None, distribution=None)[source]#

Add noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe(). If distribution is None, then the noise distribution will be chosen for you:

Pure-DP: Laplace noise, where scale == standard_deviation / sqrt(2)
zCDP: Gaussian noise, where scale == standard_devation

Parameters:

scale (float | None) – Scale parameter for the noise distribution.
distribution (Literal['Laplace'] | ~typing.Literal['Gaussian'] | None) – Either Laplace, Gaussian or None.

Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"A": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(dp.len())
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ ... │
└─────┘

null_count(scale=None)[source]#

Compute a differentially private estimate of the number of null elements in self.

This function is a shortcut for the exact Polars null_count and then noise addition.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – parameter for the noise distribution.
Example:

Count the number of records with unknown (null) visits:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(pl.col("visits").dp.null_count())
>>> query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ u32    │
╞════════╡
│ ...    │
└────────┘

Output is noise added to one.

Note that if you want to count the number of null and non-null records, consider combining the queries by constructing a boolean nullity column to group on, grouping by this column, and then using dp.len().

quantile(alpha, candidates, scale=None)[source]#

Compute a differentially private quantile.

The scale calibrates the level of entropy when selecting a candidate.

Parameters:

alpha (float) – a value in [0, 1]. Choose 0.5 for median.
candidates (list[float]) – Potential quantiles to select from.
scale (float | None) – How much noise to add to the scores of candidate.

Example:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"age": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
...     margins=[dp.polars.Margin(max_partition_length=100)]
... )
>>> candidates = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> query = context.query().select(pl.col("age").fill_null(0).dp.quantile(0.25, candidates))
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ age │
│ --- │
│ i64 │
╞═════╡
│ ... │
└─────┘

Output will be one of the candidates, with greater likelihood of being selected the closer the candidate is to the first quartile.

sum(bounds, scale=None)[source]#

Compute the differentially private sum.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:

bounds (tuple[float, float]) – clip the input data to these lower and upper bounds
scale (float | None) – parameter for the noise distribution

Example:

This function is a shortcut which actually implies several operations:

Clipping the values
Summing them
Applying noise to the sum

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"visits": [1, 2, None]}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
...     margins=[dp.polars.Margin(max_partition_length=5)]
... )
>>> query = context.query().select(pl.col("visits").fill_null(0).dp.sum((0, 1)))
>>> query.release().collect()
shape: (1, 1)
┌────────┐
│ visits │
│ ---    │
│ i64    │
╞════════╡
│ ...    │
└────────┘

Output is noise added to two due to each value being clipped to (0, 1).

class opendp.extras.polars.LazyFrameQuery(polars_plan, query)[source]#

A LazyFrameQuery may be returned by opendp.context.Context.query(). It mimics a Polars LazyFrame, but makes a few additions and changes as documented below.

filter(*predicates, **constraints)[source]#

Filter the rows in the LazyFrame based on a predicate expression.

OpenDP discards relevant margin descriptors in the domain when filtering.

Parameters:: constraints (Any) –
Return type:: LazyFrameQuery

group_by(*by, maintain_order=False, **named_by)[source]#

Start a group by operation.

OpenDP currently requires that grouping keys be simple column expressions.

Parameters:: maintain_order (bool) –
Return type:: LazyGroupByQuery

join(other, on=None, how='inner', *, left_on=None, right_on=None, suffix='_right', validate='m:m', join_nulls=False, coalesce=None, allow_parallel=True, force_parallel=False)[source]#

Add a join operation to the Logical Plan.

Parameters:

suffix (str) –
join_nulls (bool) –
coalesce (bool | None) –
allow_parallel (bool) –
force_parallel (bool) –

Return type:

LazyFrameQuery

release()[source]#

Release the query. The query must be part of a context.

Return type:: OnceFrame

resolve()[source]#

Resolve the query into a measurement.

Return type:: Measurement

select(*exprs, **named_exprs)[source]#

Select columns from this LazyFrame.

OpenDP expects expressions in select statements that don’t aggregate to be row-by-row.

Return type:: LazyFrameQuery

select_seq(*exprs, **named_exprs)[source]#

Select columns from this LazyFrame.

OpenDP allows expressions in select statements that aggregate to not be row-by-row.

Return type:: LazyFrameQuery

sort(by, *more_by, descending=False, nulls_last=False, maintain_order=False, multithreaded=True)[source]#

Sort the LazyFrame by the given columns.

Parameters:

descending (bool | Sequence[bool]) –
nulls_last (bool | Sequence[bool]) –
maintain_order (bool) –
multithreaded (bool) –

Return type:

LazyFrameQuery

summarize(alpha=None)[source]#

Summarize the statistics released by this query.

If alpha is passed, the resulting data frame includes an accuracy column.

If a threshold is configured for censoring small/sensitive partitions, a threshold column will be included, containing the cutoff for the respective count query being thresholded.

Parameters:: alpha (float | None) – optional. A value in [0, 1] denoting the statistical significance. For the corresponding confidence level, subtract from from 1: for 95% confidence, use 0.05 for alpha.
Example:

>>> import polars as pl
>>> data = pl.LazyFrame([pl.Series("convicted", [0, 1, 1, 0, 1] * 50, dtype=pl.Int32)])
>>>
>>> context = dp.Context.compositor(
...     data=data,
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.0),
...     split_evenly_over=1,
...     margins=[dp.polars.Margin(by=(), max_partition_length=1000)],
... )
>>>
>>> query = context.query().select(
...     dp.len(),
...     pl.col("convicted").fill_null(0).dp.sum((0, 1))
... )
>>>
>>> query.summarize(alpha=.05)  # type: ignore[union-attr]
shape: (2, 5)
┌───────────┬──────────────┬─────────────────┬───────┬──────────┐
│ column    ┆ aggregate    ┆ distribution    ┆ scale ┆ accuracy │
│ ---       ┆ ---          ┆ ---             ┆ ---   ┆ ---      │
│ str       ┆ str          ┆ str             ┆ f64   ┆ f64      │
╞═══════════╪══════════════╪═════════════════╪═══════╪══════════╡
│ len       ┆ Frame Length ┆ Integer Laplace ┆ 2.0   ┆ 6.429605 │
│ convicted ┆ Sum          ┆ Integer Laplace ┆ 2.0   ┆ 6.429605 │
└───────────┴──────────────┴─────────────────┴───────┴──────────┘

The accuracy in any given row can be interpreted with:

>>> def interpret_accuracy(distribution, scale, accuracy, alpha):
...     return (
...         f"When the {distribution} scale is {scale}, "
...         f"the DP estimate differs from the true value by no more than {accuracy} "
...         f"at a statistical significance level alpha of {alpha}, "
...         f"or with (1 - {alpha})100% = {(1 - alpha) * 100}% confidence."
...     )
...
>>> interpret_accuracy("Integer Laplace", 2.0, 6.429605, alpha=.05) 

with_columns(*exprs, **named_exprs)[source]#

Add columns to this LazyFrame.

OpenDP requires that expressions in with_columns are row-by-row: expressions may not change the number or order of records

Return type:: LazyFrameQuery

with_columns_seq(*exprs, **named_exprs)[source]#

Add columns to this LazyFrame.

OpenDP requires that expressions in with_columns are row-by-row: expressions may not change the number or order of records

Return type:: LazyFrameQuery

with_keys(keys, on=None)[source]#

Shorthand to join with an explicit key-set.

Parameters:

keys – lazyframe containing a key-set whose columns correspond to the grouping keys
on (list[str] | None) – optional, the names of columns to join on. Useful if the key dataframe contains extra columns

Return type:

LazyFrameQuery

class opendp.extras.polars.LazyGroupByQuery(lgb_plan, query)[source]#

A LazyGroupByQuery is returned by opendp.extras.polars.LazyFrameQuery.group_by(). It mimics a Polars LazyGroupBy, but only supports APIs documented below.

agg(*aggs, **named_aggs)[source]#

Compute aggregations for each group of a group by operation.

Parameters:

aggs – expressions to apply in the aggregation context
named_aggs – named/aliased expressions to apply in the aggregation context

Return type:

LazyFrameQuery

class opendp.extras.polars.Margin(by=None, public_info=None, max_partition_length=None, max_num_partitions=None, max_partition_contributions=None, max_influenced_partitions=None)[source]#

The Margin class is used to describe what information is known publicly about a grouped dataset: like the values you might expect to find in the margins of a table.

Be aware that aspects of your data marked as “public information” are not subject to privacy protections, so it is important that public descriptors about the margin should be set conservatively, or not set at all.

Instances of this class are used by opendp.context.Context.compositor().

Parameters:

by (Sequence | None) –
public_info (Literal['keys'] | ~typing.Literal['lengths'] | None) –
max_partition_length (int | None) –
max_num_partitions (int | None) –
max_partition_contributions (int | None) –
max_influenced_partitions (int | None) –

by: Sequence | None = None#: Polars expressions describing the grouping columns.

max_influenced_partitions: int | None = None#: The greatest number of partitions any one individual can contribute to.

max_num_partitions: int | None = None#: An upper bound on the number of distinct partitions.

max_partition_contributions: int | None = None#

The greatest number of records an individual may contribute to any one partition.

This can significantly reduce the sensitivity of grouped queries under zero-Concentrated DP.

max_partition_length: int | None = None#

An upper bound on the number of records in any one partition.

If you don’t know how many records are in the data, you can specify a very loose upper bound, for example, the size of the total population you are sampling from.

This is used to resolve issues raised in the paper Widespread Underestimation of Sensitivity in Differentially Private Libraries and How to Fix It.

public_info: Literal['keys'] | Literal['lengths'] | None = None#

Identifies properties of grouped data that are considered public information.

"keys" designates that keys are not protected
"lengths" designates that both keys and partition lengths are not protected

class opendp.extras.polars.OnceFrame(queryable)[source]#

OnceFrame is a Polars LazyFrame that may only be collected into a DataFrame once.

The APIs on this class mimic those that can be found in Polars.

Differentially private guarantees on a given LazyFrame require the LazyFrame to be evaluated at most once. The purpose of this class is to protect against repeatedly evaluating the LazyFrame.

collect()[source]#: Collects a DataFrame from a OnceFrame, exhausting the OnceFrame.

lazy()[source]#

Extracts a LazyFrame from a OnceFrame, circumventing protections against multiple evaluations.

Each collection consumes the entire allocated privacy budget. To remain DP at the advertised privacy level, only collect the LazyFrame once.

Requires “honest-but-curious” because the privacy guarantees only apply if:

The LazyFrame (compute plan) is only ever executed once.
The analyst does not observe ordering of rows in the output.

To ensure that row ordering is not observed:

Do not extend the compute plan with order-sensitive computations.
Shuffle the output once collected (in Polars sample all, with shuffle enabled).

opendp.extras.polars.dp_len(scale=None)[source]#

Compute a differentially private estimate of the number of rows.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:: scale (float | None) – parameter for the noise distribution.
Example:

This function is a shortcut for the exact Polars len and then noise addition:

>>> import polars as pl
>>> context = dp.Context.compositor(
...     data=pl.LazyFrame({"A": list(range(100))}),
...     privacy_unit=dp.unit_of(contributions=1),
...     privacy_loss=dp.loss_of(epsilon=1.),
...     split_evenly_over=1,
... )
>>> query = context.query().select(dp.len())
>>> query.release().collect()
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ ... │
└─────┘

Branches

Releases

opendp.extras.polars package#

Module contents#