Typical Workflow#

A differentially private analysis in OpenDP typically has the following steps:

  1. Identify the unit of privacy

  2. Set privacy loss parameters

  3. Collect public information

  4. Mediate access to data

  5. Submit DP queries

Diagram representing typical data flow with OpenDP, from a raw CSV to a differentially private release.

We’ll illustrate these steps by releasing a differentially private mean of a small vector of random numbers.

1. Identify the Unit of Privacy#

The first step in a differentially private analysis is to determine what you are protecting: the unit of privacy.

Releases on the data should conceal the addition or removal of any one individual’s data. Assuming you know an individual may contribute at most one row to the data set, then the unit of privacy corresponds to one row contribution.

>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")

>>> privacy_unit = dp.unit_of(contributions=1)
>>> privacy_unit
(SymmetricDistance(), 1)

>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")

>>> d_in = 1 # neighboring data set distance is at most d_in...
>>> input_metric = dp.symmetric_distance() # ...in terms of additions/removals
>>> input_domain = dp.vector_domain(dp.atom_domain(T=float))

library(opendp)
enable_features("contrib")

d_in <- 1L # neighboring data set distance is at most d_in...
input_metric <- symmetric_distance() # ...in terms of additions/removals
input_domain <- vector_domain(atom_domain(.T = f64))

The privacy unit specifies how distances are computed between two data sets (input_metric), and how large the distance can be (d_in).

Broadly speaking, differential privacy can be applied to any medium of data for which you can define a unit of privacy. In other contexts, the unit of privacy may correspond to multiple rows, a user ID, or nodes or edges in a graph.

The unit of privacy may also be more general or more precise than a single individual.

  • more general: unit of privacy is an entire household, or a company

  • more precise: unit of privacy is a person-month, or device

It is highly recommended to choose a unit of privacy that is at least as general as an individual.

2. Set Privacy Loss Parameters#

Next, you should determine what level of privacy protection to provide to your units of privacy. This choice may be governed by a variety of factors, such as the amount of harm that individuals could experience if their data were revealed, and your ethical and legal obligations as a data custodian.

The level of privacy afforded to units of privacy in a data set is quantified by privacy loss parameters. Under pure differential privacy, there is a single privacy-loss parameter, typically denoted epsilon (ε). Epsilon is a non-negative number, where larger values afford less privacy. Epsilon can be viewed as a proxy for the worst-case risk to a unit of privacy. It is customary to refer to a data release with such bounded risk as epsilon-differentially private (ε-DP).

A common rule-of-thumb is to limit ε to 1.0, but this limit will vary depending on the considerations mentioned above. See Hsu et. al for a more elaborate discussion on setting epsilon.

>>> privacy_loss = dp.loss_of(epsilon=1.)
>>> privacy_loss
(MaxDivergence(f64), 1.0)

>>> d_out = 1. # output distributions have distance at most d_out (ε)...
>>> privacy_measure = dp.max_divergence(T=float) # ...in terms of pure-DP

d_out <- 1.0 # output distributions have distance at most d_out (ε)...
privacy_measure <- max_divergence(.T = f64) # ...in terms of pure-DP

The privacy loss specifies how distances are measured between distributions (privacy_measure), and how large the distance can be (d_out).

3. Collect Public Information#

The next step is to identify public information about the data set. This could include:

  • Information that is invariant across all potential input data sets

  • Information that is publicly available from other sources

  • Information from other DP releases

Frequently we’ll specify bounds on data, based on prior knowledge of the domain.

>>> bounds = (0.0, 100.0)
>>> imputed_value = 50.0

>>> bounds = (0.0, 100.0)
>>> imputed_value = 50.0

bounds <- c(0.0, 100.0)
imputed_value <- 50.0

A data invariant is information about your data set that you are explicitly choosing not to protect, typically because it is already public or non-sensitive. Be careful, if an invariant does contain sensitive information, then you risk violating the privacy of individuals in your data set.

On the other hand, using public information significantly improves the utility of your results.

4. Mediate Access to Data#

Ideally, at this point, you have not yet accessed the sensitive data set. This is the only point in the process where we access the sensitive data set. To ensure that your specified differential privacy protections are maintained, the OpenDP Library should mediate all access to the sensitive data set.

>>> from random import randint
>>> data = [float(randint(0, 100)) for _ in range(100)]

>>> context = dp.Context.compositor(
...     data=data,
...     privacy_unit=privacy_unit,
...     privacy_loss=privacy_loss,
...     split_evenly_over=3
... )

dp.Context.compositor creates a sequential composition measurement. You can now submit up to three queries to context, in the form of measurements.

>>> from random import randint
>>> data = [float(randint(0, 100)) for _ in range(100)]

>>> m_sc = dp.c.make_sequential_composition(
...     input_domain=input_domain,
...     input_metric=input_metric,
...     output_measure=privacy_measure,
...     d_in=d_in,
...     d_mids=[d_out / 3] * 3,
... )

>>> # Call measurement with data to create a queryable:
>>> queryable = m_sc(data)

dp.c.make_sequential_composition creates a sequential composition measurement. You can now submit up to three queries to queryable, in the form of measurements.

data <- runif(100L, min = 0.0, max = 100.0)

m_sc <- make_sequential_composition(
  input_domain = input_domain,
  input_metric = input_metric,
  output_measure = privacy_measure,
  d_in = d_in,
  d_mids = rep(d_out / 3L, 3L)
)

# Call measurement with data to create a queryable:
queryable <- m_sc(arg = data) # Different from Python, which does not require "arg".

make_sequential_composition creates a sequential composition measurement. You can now submit up to three queries to queryable, in the form of measurements.

Since the privacy loss budget is at most ε = 1, and we are partitioning our budget evenly amongst three queries, then each query will be calibrated to satisfy ε = 1/3.

5. Submit DP Queries#

You can now create differentially private releases. Here’s a differentially private count:

>>> count_query = (
...     context.query()
...     .count()
...     .laplace()
... )

>>> scale = count_query.param()
>>> scale
3.0000000000000004

>>> accuracy = dp.discrete_laplacian_scale_to_accuracy(scale=scale, alpha=0.05)
>>> accuracy
9.445721638273584

>>> dp_count = count_query.release()
>>> confidence_interval = (dp_count - accuracy, dp_count + accuracy)

>>> count_transformation = (
...     dp.t.make_count(input_domain, input_metric)
... )

>>> count_sensitivity = count_transformation.map(d_in)
>>> count_sensitivity
1

>>> count_measurement = dp.binary_search_chain(
...     lambda scale: count_transformation >> dp.m.then_laplace(scale),
...     d_in,
...     d_out / 3
... )
>>> dp_count = queryable(count_measurement)

count_transformation <- (
  make_count(input_domain, input_metric)
)

count_sensitivity <- count_transformation(d_in = d_in) # Different from Python, which uses ".map".
cat("count_sensitivity:", count_sensitivity, "\n")
# 1

count_measurement <- binary_search_chain(
  function(scale) count_transformation |> then_laplace(scale), d_in, d_out / 3L
)
dp_count <- queryable(query = count_measurement) # Different from Python, which does not require "query".
cat("dp_count:", dp_count, "\n")

Here’s a differentially private mean:

>>> mean_query = (
...     context.query()
...     .clamp(bounds)
...     .resize(size=dp_count, constant=imputed_value)
...     .mean()
...     .laplace()
... )

>>> dp_mean = mean_query.release()

>>> mean_transformation = (
...     dp.t.make_clamp(input_domain, input_metric, bounds) >>
...     dp.t.then_resize(size=dp_count, constant=imputed_value) >>
...     dp.t.then_mean()
... )

>>> mean_measurement = dp.binary_search_chain(
...     lambda scale: mean_transformation >> dp.m.then_laplace(scale), d_in, d_out / 3
... )

>>> dp_mean = queryable(mean_measurement)

mean_transformation <- (
  make_clamp(input_domain, input_metric, bounds)
  |> then_resize(size = dp_count, constant = imputed_value)
  |> then_mean()
)

mean_measurement <- binary_search_chain(
  function(scale) mean_transformation |> then_laplace(scale), d_in, d_out / 3L
)

dp_mean <- queryable(query = mean_measurement) # Different from Python, which does not require "query".
cat("dp_mean:", dp_mean, "\n")

Other features#

The OpenDP Library supports more statistics, like the variance, various ways to compute histograms and quantiles, and PCA. The library also supports other mechanisms like the Gaussian Mechanism, which provides tighter privacy accounting when releasing a large number of queries, the Thresholded Laplace Mechanism, for releasing counts on data sets with unknown key sets, and variations of randomized response.