Aggregation#

[Polars Documentation]

The most common aggregators like length, sum, mean, median and quantile are covered in essential statistics.

In addition to these aggregators, OpenDP also supports other variations of counting queries. A counting query tells you how many rows in a dataset meet a given condition.

[1]:
import polars as pl
import opendp.prelude as dp

dp.enable_features("contrib")

# Fetch and unpack the data.
![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )

To get started, we’ll recreate the Context from the tabular data introduction.

[2]:
context = dp.Context.compositor(
    data=pl.scan_csv("../sample_FR_LFS.csv", ignore_errors=True),
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
)

Frame Length vs Expression Length#

Frame length is not the same as expression length. These quantities can differ if the expression changes the number of rows.

[3]:
query_len_variations = (
    context.query()
    .group_by("SEX")
    .agg([
        # total number of rows in the frame, including nulls
        dp.len(),
        # total number of rows in the HWUSUAL column (including nulls)
        pl.col.HWUSUAL.dp.len(),
    ])
    # explicitly specifying keys makes the query satisfy pure-DP
    .with_keys(pl.LazyFrame({"SEX": [1, 2]}))
)
query_len_variations.summarize()
[3]:
shape: (2, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"360.0
"HWUSUAL""Length""Integer Laplace"360.0

These two statistics are equivalent, but the frame length (the first) can be used to release stable grouping keys, while the column length (the second) can be preprocessed with filtering.

The OpenDP Library will still use margin descriptors that may reduce the sensitivity of the column length if it detects that the column has not been transformed in a way that changes the number of rows.

[4]:
query_len_variations.release().collect()
[4]:
shape: (2, 3)
SEXlenHWUSUAL
i64u32u32
19576295047
2103723104234

Unique Counts#

A count of the number of unique values in a column is as sensitive as the frame or column length when protecting user contributions. However, unlike the frame length, the sensitivity does not reduce to zero when protecting changed records, as a change in an individual’s answer may result in one more, or one less, unique value.

[5]:
query_n_unique = context.query().select([
    # total number of unique elements in the HWUSUAL column (including null)
    pl.col.HWUSUAL.dp.n_unique(),
])
query_n_unique.summarize()
[5]:
shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"HWUSUAL""N Unique""Integer Laplace"180.0
[6]:
query_n_unique.release().collect()
[6]:
shape: (1, 1)
HWUSUAL
u32
0

Noise added to a count can make the count go negative, but since the output data type is an unsigned integer, the library may return zero. This is more likely to happen with the true value is small.

This release tells us that the number of null values is relatively small.

Null and Non-Null Counts#

You can release a count of the number of null or non-null records, respectively, as follows:

[7]:
query_counts = context.query().select([
    # total number of non-null elements in the HWUSUAL column
    pl.col.HWUSUAL.dp.count(),
    # total number of null elements in the HWUSUAL column
    pl.col.HWUSUAL.dp.null_count(),
])
query_counts.summarize()
[7]:
shape: (2, 4)
columnaggregatedistributionscale
strstrstrf64
"HWUSUAL""Count""Integer Laplace"360.0
"HWUSUAL""Null Count""Integer Laplace"360.0

Notice that the count and null_count are complementary: you could instead release len for HWUSUAL grouped by whether the value is null.

You can take advantage of this to estimate both statistics with the same privacy loss, but with half as much noise.

[ ]:
query_counts_via_grouping = (
    context.query()
    .with_columns(pl.col("HWUSUAL").is_null().alias("HWUSUAL_is_null"))
    .group_by("HWUSUAL_is_null")
    .agg(dp.len())
    # we're grouping on a bool column, so the groups are:
    .with_keys(pl.LazyFrame({"HWUSUAL_is_null": [True, False]}))
)
query_counts_via_grouping.summarize()
shape: (1, 4)
columnaggregatedistributionscale
strstrstrf64
"len""Frame Length""Integer Laplace"180.0

The noise scale dropped from 360 to 180…

[9]:
query_counts_via_grouping.release().collect()
[9]:
shape: (2, 2)
HWUSUAL_is_nulllen
boolu32
false197918
true1297

…but we still get answers to all of the same queries!