Differential Privacy with OpenDP#

This notebook brings together two threads:

  • The previous page introduced basic DP ideas like sensitivity and epsilon, as well as terms that have particular meaning in OpenDP: transformations, measures, measurements, and stability.

  • The User Guide introduced OpenDP as a programming framework, without diving into the mathematics.

Here we’ll see how the programming framework is related to the underlying mathematics of differential privacy. The modular framework helps ensure that we’re using the tools of DP appropriately, but also has the flexibility to explore different approaches to DP.


Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!

>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")

The Laplace Mechanism#

The Laplace mechanism is a ubiquitous algorithm in the DP ecosystem that is typically used to privatize an aggregate, like a sum or mean.

An instance of the Laplace mechanism is captured by a measurement containing the following five elements:

  1. We first define the function \(f(\cdot)\), that applies the Laplace mechanism to some argument \(x\). This function simply samples from the Laplace distribution centered at \(x\), with a fixed noise scale.

\[f(x) = Laplace(\mu=x, b=scale)\]
  1. Importantly, \(f(\cdot)\) is only well-defined for any finite float input. This set of permitted inputs is described by the input domain atom_domain(T=f64).

  2. The Laplace mechanism has a privacy guarantee in terms of epsilon. This guarantee is represented by a privacy map, a function that computes the privacy loss \(\epsilon\) for any choice of sensitivity \(\Delta\).

\[map(\Delta) = \Delta / scale \le \epsilon\]
  1. This map only promises that the privacy loss will be at most \(\epsilon\) if inputs from any two neighboring datasets may differ by no more than some quantity \(\Delta\) under the absolute distance input metric absolute_distance(T=f64).

  2. We similarly describe units on the output (\(\epsilon\)) via the output measure max_divergence().

The OpenDP Library consists of constructor functions that can be called with simple arguments and always return valid measurements. The make_laplace constructor function returns the equivalent of the Laplace measurement described above. Since it returns a measurement, you can find it under dp.m:

>>> # call the constructor to produce the measurement `base_lap`
>>> base_lap = dp.m.make_laplace(
...     dp.atom_domain(T=float),
...     dp.absolute_distance(T=float),
...     scale=5.
... )

The supporting elements on this transformation match those described above:

>>> print("input domain:  ", base_lap.input_domain)
input domain:   AtomDomain(T=f64)
>>> print("input metric:  ", base_lap.input_metric)
input metric:   AbsoluteDistance(f64)
>>> print("output measure:", base_lap.output_measure)
output measure: MaxDivergence

We now invoke the measurement on some aggregate 0., to sample \(Laplace(\mu=0., scale=5.)\) noise:

>>> aggregate = 0.
>>> print("noisy aggregate:", base_lap(aggregate))
noisy aggregate: ...

If we are using base_lap on its own, we must know the sensitivity of aggregate (i.e. how much the aggregate can differ on two adjacent datasets) to determine epsilon. In this case, we know base_lap has an absolute distance input metric, so the sensitivity should represent the greatest possible absolute distance between aggregates on adjacent datasets.

>>> absolute_distance = 10.
>>> print("epsilon:", base_lap.map(d_in=absolute_distance))
epsilon: 2.0

This tells us that when the sensitivity is 10, and the noise scale is 5, the epsilon consumption of a release is 2.

Transformation Example: Sum#

We package computations with bounded stability into transformations.

A transformation that computes the sum of a vector dataset contains a very similar set of six elements:

  1. We first define the function \(f(\cdot)\), that computes the sum of some argument \(x\).

\[f(x) = \sum x_i\]
  1. \(f(\cdot)\) is only well-defined for any vector input of a specific type. Each element must be bounded between some lower bound L and upper bound U. Thus the input domain is of type vector_domain(atom_domain(T=f64)) with elements restricted between L and U.

  2. The output domain consists of any single finite f64 scalar: atom_domain(T=f64).

  3. The sum transformation has a stability guarantee in terms of sensitivity. This guarantee is represented by a stability map, which is a function that computes the stability \(d_{out}\) for any choice of dataset distance \(d_{in}\). In this case \(d_{out}\) is in terms of the sensitivity.

\[map(d_{in}) = d_{in} \cdot \max(|L|, U) \le d_{out}\]
  1. This map only promises a sensitivity of \(d_{out}\) under the assumption that neighboring datasets differ by no more than some quantity \(d_{in}\) under the symmetric distance input metric symmetric_distance().

  2. The sensitivity is computed with respect to the absolute distance. This gives units to the output (\(d_{out}\)) via the output metric absolute_distance(T=f64).

make_sum constructs the equivalent of the sum transformation described above. It is important to note that since the bounds are float, the resulting transformation is calibrated to work for floating-point numbers. You will need to be careful and intentional about the types you use. Since it returns a transformation, you can find it under dp.t:

>>> # call the constructor to produce the transformation `bounded_sum`
>>> # notice that `make_sum` expects an input domain consisting of bounded data:
>>> input_domain = dp.vector_domain(dp.atom_domain(bounds=(0., 5.)))
>>> bounded_sum = dp.t.make_sum(input_domain, dp.symmetric_distance())

According to the documentation, this transformation expects a vector of data with non-null elements bounded between 0. and 5.. We now invoke the transformation on some mock dataset that satisfies this constraint. Remember that since this component is a transformation, and not a measurement, the resulting output is not differentially private.

>>> # under the condition that the input data is a member of the input domain...
>>> bounded_mock_dataset = [1.3, 3.8, 0., 5.]
>>> # ...the exact sum is:
>>> bounded_sum(bounded_mock_dataset)
10.1

It can help to understand a simple example of how a stability map works, but going forward you don’t need to understand why the maps give the numbers they give in order to use the library.

The stability argument for this transformation’s advertised sensitivity goes roughly as follows:

If the input data consists of numbers bounded between 0. and 5.,
then the addition or removal of any one row can influence the sum by \(max(|0.|, 5.)\).
In addition, if one individual may contribute up to k rows,
then the sensitivity should further be multiplied by k.

In practice, the calculated sensitivity may be larger under certain conditions to account for the inexactness of arithmetic on finite data types.

>>> # under the condition that one individual may contribute up to 2 records to `bounded_mock_dataset`...
>>> max_contributions = 2
>>> # ...then the sensitivity, expressed in terms of the absolute distance, is:
>>> bounded_sum.map(d_in=max_contributions)
10.0...

As we would expect, the sensitivity is roughly 2 * max(|0.|, 5.).

Transformation Example: Clamp#

The sum transformation has an input domain of vectors with bounded elements. We now construct a transformation that clamps/clips each element to a given set of bounds.

Instead of listing the components of a clamp transformation as I’ve done above, going forward you can check the **Supporting Elements** section of the relevant API documentation entry:

>>> help(dp.t.make_clamp)
Help on function make_clamp in module opendp.transformations:
...

Documentation for specific types may be found behind the following links:

>>> input_domain = dp.vector_domain(dp.atom_domain(T=float))
>>> input_metric = dp.symmetric_distance()

>>> # call the constructor to produce the transformation `clamp`
>>> clamp = dp.t.make_clamp(input_domain, input_metric, bounds=(0., 5.))

>>> # `clamp` expects vectors of non-null, unbounded elements
>>> mock_dataset = [1.3, 7.8, -2.5, 7.0]
>>> # `clamp` emits data that is suitable for `bounded_sum`
>>> clamp(mock_dataset)
[1.3, 5.0, 0.0, 5.0]

According to the API documentation, the input and output metric is set by the user. We passed in a symmetric distance metric. Therefore, the stability map accepts a dataset distance describing the maximum number of contributions an individual may make, and emits the same.

The stability argument for the clamp transformation is very simple:

If an individual may influence at most k records in a dataset, then after clamping each element,
an individual may still influence at most k records in a dataset.
>>> # dataset distance in... dataset distance out
>>> clamp.map(max_contributions)
2

Chaining#

The OpenDP library supports chaining a transformation with a transformation to produce a compound transformation, or a transformation with a measurement to produce a compound measurement.

When any two compatible computations are chained, all six components of each primitive are used to construct the new primitive.

A measurement produced from chaining a transformation with a measurement contains the same set of six elements as in previous examples:

  1. A function \(f(\cdot)\). When you chain, the output domain of the transformation must match the input domain of the measurement.

\[f(x) = measurement(transformation(x))\]
  1. The input domain from the transformation.

  2. The output domain from the measurement.

  3. A privacy_map \(map(\cdot)\). When you chain, the output metric of the transformation must match the input metric of the measurement.

\[map(d_{in}) = measurement.map(transformation.map(d_{in}))\]
  1. The input metric from the transformation.

  2. The output measure from the measurement.

A similar logic is used when chaining a transformation with a transformation.

We know that the

  • output domain of bounded_sum matches the input domain of base_lap, and the

  • output metric of bounded_sum matches the input metric of base_lap.

The same holds for clamp and bounded_sum. Therefore, we can chain all of these primitives to form a new compound measurement:

>>> dp_sum = clamp >> bounded_sum >> base_lap

>>> # compute the DP sum of a dataset of bounded elements
>>> print("DP sum:", dp_sum(mock_dataset))
DP sum: ...

>>> # evaluate the privacy loss of the dp_sum, when an individual can contribute at most 2 records
>>> print("epsilon:", dp_sum.map(d_in=max_contributions))
epsilon: ...

Retrospective#

Now that you have a more thorough understanding of what’s going on, we can breeze through an entire release:

>>> # establish public info
>>> max_contributions = 2
>>> bounds = (0., 5.)

>>> # construct the measurement
>>> dp_sum = (
...     dp.t.make_clamp(dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance(), bounds) >>
...     dp.t.make_sum(dp.vector_domain(dp.atom_domain(bounds=bounds)), dp.symmetric_distance()) >>
...     dp.m.make_laplace(dp.atom_domain(T=float), dp.absolute_distance(T=float), 5.)
... )

>>> # evaluate the privacy expenditure and make a DP release
>>> mock_dataset = [0.7, -0.3, 1., -1.]
>>> print("epsilon:", dp_sum.map(max_contributions))
epsilon: ...
>>> print("DP sum release:", dp_sum(mock_dataset))
DP sum release: ...

Partial Constructors#

You may notice some redundancy in the code for dp_sum above: The output domain of a transformation will always match the input of its successor. We can make this shorter by using then_* constructors: These are paired with make_* constructors, but delay application of the input_domain and input_metric arguments. We can rewrite dp_sum in an equivalent but more concise form:

>>> dp_sum = (
...     (input_domain, input_metric) >>
...     dp.t.then_clamp((0., 5.)) >>
...     dp.t.then_sum() >>
...     dp.m.then_laplace(5.)
... )

You’ll notice that the start of the chain is special: We provide a tuple to specify the input_domain and input_metric for then_clamp.