Grouping#

The OpenDP Library allows you to compute statistics over grouped data. We’ll examine three settings with progressively more public information:

  • Protected group keys: public_info=None

  • Public group keys: public_info="keys"

  • Public group lengths: public_info="lengths"

The API Reference provides more information about the methods. We will use the sample data from the Labor Force Survey in France.

[1]:
import polars as pl
import opendp.prelude as dp
import hvplot

dp.enable_features("contrib")
hvplot.extension("bokeh")

# Fetch data.
![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 5933k  100 5933k    0     0  1660k      0  0:00:03  0:00:03 --:--:-- 1894k
Archive:  sample_FR_LFS.csv.zip
  inflating: sample_FR_LFS.csv
  inflating: __MACOSX/._sample_FR_LFS.csv

Protected Group Keys#

Grouping keys themselves can be extremely sensitive. For example, sharing a noisy count query grouped by social security numbers or credit card numbers would catastrophically violate the privacy of individuals in the data. Similarly, re-identification of individuals can be accomplished with very small combinations of grouping columns that may seem benign (for reference, see how Latanya Sweeney reidentified the Governor of Massachusetts in healthcare data).

For this reason, by default, the OpenDP Library discards any grouping key that may be unique to an individual by filtering out data partitions with too few records. OpenDP calibrates this filtering threshold such that the probability of releasing a sensitive grouping key is no greater than the privacy parameter delta (δ).

In the following example, only counts for combinations of year and sex that are common among many workers are released.

[3]:
context = dp.Context.compositor(
    # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,
    # so we'll set `ignore_errors` to true to avoid conversion errors.
    data=pl.scan_csv("sample_FR_LFS.csv", ignore_errors=True),
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0 / 3, delta=1e-7),
    # allow for one query
    split_evenly_over=1,
)

query_age_ilostat = (
    context.query()
    .group_by("AGE", "ILOSTAT")
    .agg(pl.len().dp.noise())
)

Before releasing the query, lets take a look at properties we can expect of the output:

[4]:
query_age_ilostat.summarize(alpha=.05)
[4]:
shape: (1, 6)
columnaggregatedistributionscaleaccuracythreshold
strstrstrf64f64u32
"len""Len""Integer Laplace"108.0324.0379282089

Judging from the query description, noisy counts returned by this query will differ by no more than the given accuracy with 1 - alpha = 95% confidence. Any grouping keys with noisy counts below the given threshold will be filtered from the release.

This level of utility seems suitable, so we’ll go ahead and release the query:

[5]:
df = query_age_ilostat.release().collect()

line = df.sort("AGE").plot.line(x="AGE", y="len", by="ILOSTAT")
scatter = df.sort("AGE").plot.scatter(x="AGE", y="len", by="ILOSTAT")
line * scatter
[5]:

Referring back to the ILOSTAT legend:

ILOSTAT

Legend

Comments

1

did any work for pay or profit

most common among people between 30 and 50

2

employed but not working

slightly more commonly observed among young adults

3

was not working because of lay-off

clearly influenced by retirement

9

not applicable, less than 15 years old

only present for the youngest age group

Where points are missing in the graph, there are not enough individuals for that combination of age and employment status to pass threshold.

Public Group Keys#

The OpenDP Library also allows you to explicitly describe grouping keys as “public information” that can be released in the clear.

Be aware that any aspect of your data labeled “public information” is not subject to privacy protections. For this reason, descriptors should be used conservatively, or not at all. When used judiciously, domain descriptors can improve the utility of your releases without damaging the integrity of your privacy guarantee.

For example, in the Eurostat data, the quarters in which data has been collected may be considered public information. You can mark grouping keys for any combination of year and quarter by setting public_info="keys" as follows:

[6]:
context = dp.Context.compositor(
    data=pl.scan_csv("sample_FR_LFS.csv", ignore_errors=True),
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0 / 3),
    split_evenly_over=1,
    margins={
        # grouping keys by "YEAR" and "QUARTER" are public information
        ("YEAR", "QUARTER"): dp.polars.Margin(
            public_info="keys",
        )
    },
)

It is recommended to only ever create one Context that spans all queries you may make on your data. We create a second context here to demonstrate how margins will influence the analysis.

Due to the existence of this margin descriptor, the library will now release these quarterly keys in the clear.

[7]:
query_quarterly_counts = (
    context.query()
    .group_by("YEAR", "QUARTER")
    .agg(pl.len().dp.noise())
)

summary = query_quarterly_counts.summarize(alpha=.05)
summary
[7]:
shape: (1, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"len""Len""Integer Laplace"108.0324.037928

This query description no longer has a “threshold” field: all noisy statistics computed on each data partition will be released.

This visualization includes error bars showing 95% confidence intervals for the true value by pulling the accuracy estimate from the query description above.

[8]:
df = query_quarterly_counts.release().collect()

# build a date column
df = df.with_columns(pl.date(pl.col("YEAR"), pl.col("QUARTER") * 4, 1))

line = df.plot.line(x="date", y="len")
errorbars = df.with_columns(accuracy=summary["accuracy"][0]) \
    .plot.errorbars(x="date", y="len", yerr1="accuracy")
line * errorbars
[8]:

Even though the noise scale and accuracy estimate is the same as in the previous protected group keys query, the relative error is now much larger because the group sizes are much smaller. In spite of this, the release clearly still shows that the number of respondents increased significantly from 2008 to 2010.

Public Group Sizes#

It is also possible to declare partition sizes as public information. Setting this value implies that partition keys are public, and thus implies an even greater risk of damaging the integrity of the privacy guarantee than if you were to just specify grouping keys as public.

Nevertheless, this approach has seen use in high-profile data releases, including the US Census Bureau in the form of “data invariants”.

One way this could be used is as part of a release for the mean number of hours worked HWUSUAL by sex. Referring back to the HWUSUAL legend, a value of 99 means not applicable, and this encoding will significantly bias the outcome of the query.

Unfortunately, filtering the data within the query results in the margin info being invalidated. We are still working on expanding the preprocessing functionality and logic for preserving domain descriptors in the library, but one way to work around this limitation in the meantime is to preprocess your data before passing it into the context:

[9]:
lf_preprocessed = pl.scan_csv("sample_FR_LFS.csv", ignore_errors=True).filter(pl.col("HWUSUAL") < 99)

You can now set up your analysis such that the margin applies to the preprocessed data:

[10]:
context = dp.Context.compositor(
    data=lf_preprocessed,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0 / 3),
    split_evenly_over=1,
    margins={
        # total number of responses when grouped by "SEX" is public information
        ("SEX",): dp.polars.Margin(
            public_info="lengths",
            max_partition_length=60_000_000, # population of France
            max_num_partitions=1,
        )
    },
)

You can now prepare a query that computes mean working hours by gender, where each response is clipped between 0 and 98:

[11]:
query_work_hours = (
    context.query()
    .group_by("SEX")
    .agg(pl.col("HWUSUAL").fill_null(0).dp.mean((0, 98)))
)

query_work_hours.summarize(alpha=.05)
[11]:
shape: (2, 5)
columnaggregatedistributionscaleaccuracy
strstrstrf64f64
"HWUSUAL""Sum""Float Laplace"5762.0240217261.481317
"HWUSUAL""Len"nullnull0.0

This time the query description breaks down into two separate statistics for the resulting “HWUSUAL” column. The mean is computed by dividing separate sum and length estimates. Since the partition length is marked as public information, the length is released in the clear, without any noise.

Reading the table, the smallest suitable noise scale parameter for the sum is 5762. This may seem large, but remember this noisy sum will be divided by the length, reducing the variance of the final estimate. You can divide the sum accuracy estimate by the public partition sizes to get disaggregated accuracy estimates for each partition.

Be mindful that accuracy estimates for sums and means do not take into account bias introduced from clipping. However, for this specific dataset, the codebook for limits the range of work hours in HWUSUAL to between 0 and 98, so the clipping introduced for differential privacy will not further increase bias.

[19]:
df = query_work_hours.release().collect()

# released dataframes from the OpenDP Library are shuffled to conceal the ordering of rows in the original dataset
# therefore, to ensure proper alignment, we use join instead of hstack to add labels
pl.DataFrame({"SEX": [1, 2], "SEX_STR": ["male", "female"]}).join(df, on="SEX")
[19]:
shape: (2, 3)
SEXSEX_STRHWUSUAL
i64strf64
1"male"40.66939
2"female"34.380722

Throughout this analysis, observe how the noise scale always remained fixed, regardless of group size. Also observe how the more individuals there are in the data, the greater the magnitude of the statistic. Therefore, as there are more individuals in a bin, the relative amount of noise decreases.

Grouping by too many keys, however, can result in partitions having too few records and a loss of signal/poor utility.

While grouping is a very useful strategy to conduct meaningful data analysis, there must be a balance, because excessive grouping can lead to most of your partitions being filtered, and/or results that are too noisy and misleading.

While not used in this section, remember that you can request multiple different statistics in one query. Batching multiple statistics together can not only result in computational speedups, but also more balanced allocation of privacy budget across queries, and can help you avoid releasing the same protected grouping keys multiple times (further dividing your privacy budget).

In a real data setting where you want to mediate all access to the data through one Context, you can provide domain descriptors for many margins, allowing you to relax protections over specific grouping columns, while enforcing full protections for all other grouping columns.