Essential Statistics#

This section demonstrates how to use the OpenDP Library to compute essential statistical measures with Polars.

  • Count

  • Sum

  • Mean

  • Median

  • Quantile

We will use sample data from the Labor Force Survey in France.

[21]:
import polars as pl
import opendp.prelude as dp

dp.enable_features("contrib")

# Fetch and unpack the data.
![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )

To get started, we’ll recreate the Context from the tabular data introduction.

[2]:
context = dp.Context.compositor(
    data=pl.scan_csv("sample_FR_LFS.csv", ignore_errors=True),
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
)

In practice, it is recommended to only ever create one Context that spans all queries you may make on your data. However, to more clearly explain the functionality of the library, the following examples do not follow this recommendation.

Count#

The simplest query is a count of the number of records in a dataset.

[3]:
query_num_responses = context.query().select(pl.len().dp.noise())

If you have not used Polars before, please familiarize yourself with the query syntax by reading Polars’ Getting Started. OpenDP specifically targets the lazy API, not the eager API.

You can obtain an accuracy estimate before committing to a release:

[4]:
query_num_responses.summarize(alpha=0.05)
[4]:
shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"len""Len""Integer Laplace"180.0539.731115

If you were to release query_count, then the DP len estimate will differ from the true len by no more than the given accuracy with 1 - alpha = 95% confidence.

Assuming this level of utility justifies the loss of privacy (ε = 0.2), release the query:

[5]:
query_num_responses.release().collect().item()
[5]:
199895

As shown in these code snippets, OpenDP Polars differs from typical Polars in three ways:

  1. How you specify the data

    Instead of directly manipulating the data (a LazyFrame), you now manipulate context.query() (a LazyFrameQuery). You can think of context.query() as a mock for the real data (in reality, a LazyFrameQuery is an empty LazyFrame with some extra methods).

  2. How you construct the query

    OpenDP extends the Polars API to include differentially private methods and statistics. LazyFrame (now LazyFrameQuery) has additional methods, as shown above with .release and .accuracy. Expressions have an additional namespace .dp, as shown above with .dp.noise.

  3. How you run the query

    When used from OpenDP, you must first call .release() before executing the computation with .collect(). .release() tells the OpenDP Library to ensure the query satisfies differential privacy, account for the privacy loss of releasing the query, and then allow the query to be collected once.

We now provide more examples for other common statistics.

Sum#

In this section we compute a privacy-preserving total of work hours across all responses.

The OpenDP Library is not yet able to reason about the privacy properties of many of the data transformations in Polars. In the meantime, it is possible to run data transformations yourself before passing the data into the context.

[6]:
data = (
    pl.scan_csv("sample_FR_LFS.csv")
        # TODO: support stable casting of dtype (https://github.com/opendp/opendp/issues/1710)
        .with_columns(HWUSUAL=pl.col.HWUSUAL.cast(int))
        # TODO: preserve more margin info through filtering (https://github.com/opendp/opendp/issues/1932)
        .filter(pl.col("HWUSUAL") != 99)
)

The OpenDP Library ensures that privacy guarantees take into account the potential for overflow and/or numerical instability. For this reason, many statistics require a known upper bound on how many records can be present in the data. This descriptor will need to be provided when you first construct the Context, in the form of a margin. A margin is used to describe certain properties that a potential adversary would already know about the data.

[7]:
context = dp.Context.compositor(
    data=data,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    # NEW CODE STARTING HERE
    margins={
        # when data is not grouped (empty tuple)...
        (): dp.polars.Margin(
            # ...the biggest (and only) partition is no larger than
            #    France population * number of quarters
            max_partition_length=60_000_000 * 36
        ),
    },
)

Each dp.polars.Margin contains descriptors about the dataset when grouped by columns. Since we’re not yet grouping, the tuple of grouping columns is empty: (). The OpenDP Library references this margin when you use .select in a query.

This margin provides an upper bound on how large any partition can be (max_partition_length). An adversary could very reasonably surmise that there are no more responses in each quarter than the population of France. The population of France was about 60 million in 2004 so we’ll use that as our maximum partition length. Source: World Bank. By giving up this relatively inconsequential fact about the data to a potential adversary, the library is able to ensure that overflow and/or numerical instability won’t undermine privacy guarantees.

Now that you’ve become acquainted with margins, lets release some queries that make use of it. We start by releasing the total number of work hours across responses.

[8]:
query_work_hours = context.query().select(
    pl.col("HWUSUAL").fill_null(0).dp.sum(bounds=(0, 80))
)

This query uses an expression .dp.sum that clips the range of each response, sums, and then adds sufficient noise to satisfy the differential privacy guarantee.

Since the sum is sensitive to null values, OpenDP also requires that inputs are not null. .fill_null fulfills this requirement by imputing null values with the provided expression (in this case zero).

Do not use private data to calculate imputed values or bounds: This could leak private information, reducing the integrity of the privacy guarantee. Instead, choose bounds and imputed values based on prior domain knowledge.

[9]:
query_work_hours.summarize(alpha=0.05)
[9]:
shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Integer Laplace"14400.043139.04473

The noise scale 1440 comes from the product of 36 (number of contributions), 80 (max number of work hours) and 5 (number of queries).

If you were to release query_work_hours, then the DP sum estimate will differ from the clipped sum by no more than the given accuracy with 1 - alpha = 95% confidence. Notice that the accuracy estimate does not take into account bias introduced by clipping responses.

[10]:
query_work_hours.release().collect()
[10]:
shape: (1, 1)
HWUSUAL
i64
2961886

Even though the accuracy estimate may have seemed large, in retrospect we see it is actually quite tight. Our noisy release of nearly 3 million work hours likely only differs from total clipped work hours by no more than 43k.

One adjustment made to get better utility was to change the data type we are summing to an integer. When the max_partition_length is very large, the worst-case error from summing floating-point numbers also becomes very large. This numerical imprecision can significantly impact the utility of the release.

Mean#

As of the current release of the OpenDP Library, it is recommended to estimate means by separately releasing sum and count estimates.

It is possible to release both of the above statistics in one query:

[11]:
query_work_hours = context.query().select(
    pl.col("HWUSUAL").fill_null(0).dp.sum(bounds=(0, 80)),
    pl.len().dp.noise()
)

query_work_hours.summarize(alpha=.05)
[11]:
shape: (2, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Integer Laplace"28800.086277.589474
"len""Len""Integer Laplace"360.01078.963271

This joint query satisfies the same privacy guarantee as each of the previous individual queries, by adding twice as much noise to each query.

Feel free to reuse the same noisy count estimate to estimate several means on different columns.

[12]:
query_work_hours.release().collect()
[12]:
shape: (1, 2)
HWUSUALlen
i64u32
294840178484

If the dataset size is considered public information you can instead use .dp.mean. You can specify this in the margin via: public_info="lengths" (we plan to allow use of .dp.mean without public lengths in future work).

[13]:
context_bounded_dp = dp.Context.compositor(
    data=data,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    margins={
        (): dp.polars.Margin(
            max_partition_length=60_000_000 * 36,
            # ADDITIONAL CODE STARTING HERE
            # make partition size public
            public_info="lengths"
        ),
    },
)

query_mean_work_hours = context_bounded_dp.query().select(
    pl.col("HWUSUAL").fill_null(0).dp.mean(bounds=(0, 80))
)

When public_info="lengths" is set, the number of records in the data is not protected (for those familiar with DP terminology, this is equivalent to bounded-DP). Therefore when computing the mean, a noisy sum is released and subsequently divided by the exact length. This behavior can be observed in the query summary:

[14]:
query_mean_work_hours.summarize(alpha=.05)
[14]:
shape: (2, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Integer Laplace"7200.021569.772352
"HWUSUAL""Len"nullnull0.0
[15]:
query_mean_work_hours.release().collect()
[15]:
shape: (1, 1)
HWUSUAL
f64
37.67509

To recap, we’ve shown how to estimate linear statistics like counts, sums and means. These estimates were all privatized by output perturbation (adding noise to a value).

Median#

Unfortunately, output perturbation does not work well for releasing private medians (.dp.median) and quantiles (.dp.quantile). Instead of passing bounds, the technique used to release these quantities requires you specify candidates, which are potential outcomes to be selected from. The expression privately selects the candidate that is nearest to the true median (or quantile).

For example, to privately release the median over HWUSUAL you might set candidates to whole numbers between 20 and 60:

[16]:
candidates = list(range(20, 60))

query_median_hours = context.query().select(
    pl.col("HWUSUAL").fill_null(40).dp.median(candidates)
)

The more candidates you pass, the higher the computational overhead (growing logarithmically) and memory overhead (growing linearly).

[17]:
query_median_hours.summarize(alpha=0.05)
[17]:
shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""0.5-Quantile""GumbelMin"360.0null

The aggregate value shows “0.5-Quantile” because .dp.median internally just calls .dp.quantile with an alpha parameter set to 0.5.

This time the accuracy estimate is unknown because the algorithm isn’t directly adding noise: it’s scoring each candidate, adding noise to each score, and then releasing the candidate with the best noisy score. While this approach results in much better utility than output perturbation would for this kind of query, it prevents us from providing accuracy estimates.

[18]:
query_median_hours.release().collect()
[18]:
shape: (1, 1)
HWUSUAL
i64
37

This median estimate is consistent with the mean estimate from the previous section.

Quantile#

.dp.quantile additionally requires an alpha parameter between zero and one, designating the proportion of records less than the desired release.

For example, the following query computes the .25-quantile (25th percentile) of work hours:

[19]:
query_025_quantile_hours = context.query().select(
    pl.col("HWUSUAL").fill_null(40).dp.quantile(alpha=0.25, candidates=candidates)
)
query_025_quantile_hours.release().collect()
[19]:
shape: (1, 1)
HWUSUAL
i64
35

Since work hours tend to be concentrated around 40, it makes sense that the first quartile isn’t much less than the median.

Multi-Statistics#

You can release multiple statistics in one query. For example, the following query simultaneously releases multiple quantiles:

[20]:
query_multi_quantiles = context.query().select(
    pl.col("HWUSUAL").fill_null(40).dp.quantile(a, candidates).alias(f"{a}-Quantile")
    for a in [0.25, 0.5, 0.75]
)
query_multi_quantiles.summarize()
[20]:
shape: (3, 4)
columnaggregatedistributionscale
strstrstrf64
"0.25-Quantile""0.25-Quantile""GumbelMin"3240.0
"0.5-Quantile""0.5-Quantile""GumbelMin"1080.0
"0.75-Quantile""0.75-Quantile""GumbelMin"3240.0

When you do not set the scale parameter yourself, the privacy budget is distributed evenly across each statistic. Judging from the scale parameters in the summary table, it may seem that more of the privacy budget was allocated for the median, but this is only due to internal implementation details.

Throughout this notebook, all .dp expressions take an optional scale parameter that can be used to more finely control how much noise is added to queries. The library then rescales all of these parameters up or down to satisfy a global privacy guarantee.

Now that you have a handle on the essential statistics, the next section will introduce you to applying these statistics over groupings of your data.