This documentation is for a development version of OpenDP.

The current release of OpenDP is v0.11.1.

opendp.extras.polars package#

Module contents#

This module requires extra installs: pip install opendp[polars]

The opendp.extras.polars module adds differential privacy to the Polars DataFrame library.

For convenience, all the members of this module are also available from opendp.prelude. We suggest importing under the conventional name dp:

>>> import opendp.prelude as dp

The methods of this module will then be accessible at dp.polars.

class opendp.extras.polars.DPExpr(expr)[source]#

If both opendp and polars have been imported, the methods of DPExpr are registered under the dp namespace in Polars expressions. An expression can be used as a plan in opendp.measurements.make_private_lazyframe(); See the full example there for more information.

This class is typically not used directly by users: Instead its methods are registered under the dp namespace of Polars expressions.

>>> import polars as pl
>>> pl.len().dp
<opendp.extras.polars.DPExpr object at ...>

In addition to the DP-specific methods documented below, some Polars Expr methods are also supported. For these, the best documentation is the official Polars documentation.

Supported Polars Expr Methods#

Method

Comments

alias

Rename the expression

eq, ne, lt, le, gt, ge

Comparison operators may be more readable: == != < <= > >=

and_, or_, xor

Bit-wise operators may be more readable: & | ^

is_null, is_not_null, is_finite, is_not_finite, is_nan, is_not_nan, not

Boolean information

clip

Set value outside bounds to boundary value

fill_null, fill_nan

Fill missing values with provided value

lit

Return an expression representing a literal value

A few Expr aggregation methods are also available:

Supported Polars Expr Aggregation Methods#

Method

Comments

len

Number of rows, including nulls

count

Number of rows, not including nulls

sum

Sum

gaussian(scale=None)[source]#

Add Gaussian noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:

scale (float | None) – Noise scale parameter for the Gaussian distribution. scale == standard_deviation

Example:

>>> import polars as pl
>>> expression = pl.len().dp.gaussian()
>>> print(expression)
len()...:noise([...])
laplace(scale=None)[source]#

Add Laplace noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:

scale (float | None) – Noise scale parameter for the Laplace distribution. scale == standard_deviation / sqrt(2)

Example:

>>> import polars as pl
>>> expression = pl.len().dp.laplace()
>>> print(expression)
len()...:noise([...])
mean(bounds, scale=None)[source]#

Compute the differentially private mean.

The amount of noise to be added to the sum is determined by the scale. If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:
  • bounds (Tuple[float, float]) – The bounds of the input data.

  • scale (float | None) – Noise scale parameter for the Laplace distribution. scale == standard_deviation / sqrt(2)

Example:

>>> import polars as pl
>>> expression = pl.col('numbers').dp.mean((0, 10))
>>> print(expression)
[(col("numbers").clip([...]).sum()...:noise([...])) / (len())]
median(candidates, scale=None)[source]#

Compute a differentially private median.

The scale calibrates the level of entropy when selecting a candidate. If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:
  • candidates (list[float]) – Potential quantiles to select from.

  • scale (float | None) – How much noise to add to the scores of candidate.

Example:

>>> import polars as pl
>>> expression = pl.col('numbers').dp.quantile(0.5, [1, 2, 3])
>>> print(expression)
col("numbers")...:discrete_quantile_score([...])...:report_noisy_max([...])...:index_candidates([...])
noise(scale=None, distribution=None)[source]#

Add noise to the expression.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe(). If distribution is None, then the noise distribution will be chosen for you:

  • Pure-DP: Laplace noise, where scale == standard_deviation / sqrt(2)

  • zCDP: Gaussian noise, where scale == standard_devation

Parameters:
  • scale (float | None) – Scale parameter for the noise distribution.

  • distribution (Literal['Laplace'] | ~typing.Literal['Gaussian'] | None) – Either Laplace, Gaussian or None.

Example:

>>> import polars as pl
>>> expression = pl.len().dp.noise()
>>> print(expression)
len()...:noise([...])
quantile(alpha, candidates, scale=None)[source]#

Compute a differentially private quantile.

The scale calibrates the level of entropy when selecting a candidate.

Parameters:
  • alpha (float) – a value in [0, 1]. Choose 0.5 for median.

  • candidates (list[float]) – Potential quantiles to select from.

  • scale (float | None) – How much noise to add to the scores of candidate.

Example:

>>> import polars as pl
>>> expression = pl.col('numbers').dp.quantile(0.5, [1, 2, 3])
>>> print(expression)
col("numbers")...:discrete_quantile_score([...])...:report_noisy_max([...])...:index_candidates([...])
sum(bounds, scale=None)[source]#

Compute the differentially private sum.

If scale is None it is filled by global_scale in opendp.measurements.make_private_lazyframe().

Parameters:
  • bounds (Tuple[float, float]) – The bounds of the input data.

  • scale (float | None) – Noise scale parameter for the Laplace distribution. scale == standard_deviation / sqrt(2)

Example:

Note that sum is a shortcut which actually implies several operations:

  • Clipping the values

  • Summing them

  • Applying Laplace noise to the sum

>>> import polars as pl
>>> expression = pl.col('numbers').dp.sum((0, 10))
>>> print(expression)
col("numbers").clip([...]).sum()...:noise([...])
class opendp.extras.polars.LazyFrameQuery[source]#

A LazyFrameQuery may be returned by opendp.context.Context.query(). It mimics a Polars LazyFrame, but makes a few additions as documented below.

release()[source]#

Release the query. The query must be part of a context.

Return type:

OnceFrame

resolve()[source]#

Resolve the query into a measurement.

Return type:

Measurement

summarize(alpha=None)[source]#

Summarize the statistics released by this query.

Parameters:

alpha (float | None) –

class opendp.extras.polars.Margin(public_info=None, max_partition_length=None, max_num_partitions=None, max_partition_contributions=None, max_influenced_partitions=None)[source]#

The Margin class is used to describe what information is known publicly about a grouped dataset: like the values you might expect to find in the margins of a table.

Be aware that aspects of your data marked as “public information” are not subject to privacy protections, so it is important that public descriptors about the margin should be set conservatively, or not set at all.

Instances of this class are used by opendp.context.Context.compositor().

Parameters:
  • public_info (Literal['keys'] | ~typing.Literal['lengths'] | None) –

  • max_partition_length (int | None) –

  • max_num_partitions (int | None) –

  • max_partition_contributions (int | None) –

  • max_influenced_partitions (int | None) –

max_influenced_partitions: int | None = None#

The greatest number of partitions any one individual can contribute to.

max_num_partitions: int | None = None#

An upper bound on the number of distinct partitions.

max_partition_contributions: int | None = None#

The greatest number of records an individual may contribute to any one partition.

This can significantly reduce the sensitivity of grouped queries under zero-Concentrated DP.

max_partition_length: int | None = None#

An upper bound on the number of records in any one partition.

If you don’t know how many records are in the data, you can specify a very loose upper bound, for example, the size of the total population you are sampling from.

This is used to resolve issues raised in the paper Widespread Underestimation of Sensitivity in Differentially Private Libraries and How to Fix It.

public_info: Literal['keys'] | Literal['lengths'] | None = None#

Identifies properties of grouped data that are considered public information.

  • "keys" designates that keys are not protected

  • "lengths" designates that both keys and partition lengths are not protected

class opendp.extras.polars.OnceFrame(queryable)[source]#
collect()[source]#

Collects a DataFrame from a OnceFrame, exhausting the OnceFrame.

lazy()[source]#

Extracts a LazyFrame from a OnceFrame, circumventing protections against multiple evaluations.

Each collection consumes the entire allocated privacy budget. To remain DP at the advertised privacy level, only collect the LazyFrame once.

Requires “honest-but-curious” because the privacy guarantees only apply if: 1. The LazyFrame (compute plan) is only ever executed once. 2. The analyst does not observe ordering of rows in the output. To ensure this, shuffle the output.