Differential Privacy with OpenDP#
This notebook brings together two threads:
The previous page introduced basic DP ideas like sensitivity and epsilon, as well as terms that have particular meaning in OpenDP: transformations, measures, measurements, and stability.
The User Guide introduced OpenDP as a programming framework, without diving into the mathematics.
Here we’ll see how the programming framework is related to the underlying mathematics of differential privacy. The modular framework helps ensure that we’re using the tools of DP appropriately, but also has the flexibility to explore different approaches to DP.
Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!
>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")
The Laplace Mechanism#
The Laplace mechanism is a ubiquitous algorithm in the DP ecosystem that is typically used to privatize an aggregate, like a sum or mean.
An instance of the Laplace mechanism is captured by a measurement containing the following five elements:
We first define the function \(f(\cdot)\), that applies the Laplace mechanism to some argument \(x\). This function simply samples from the Laplace distribution centered at \(x\), with a fixed noise scale.
\[f(x) = Laplace(\mu=x, b=scale)\]
Importantly, \(f(\cdot)\) is only well-defined for any finite float input. This set of permitted inputs is described by the input domain
atom_domain(T=f64)
.The Laplace mechanism has a privacy guarantee in terms of epsilon. This guarantee is represented by a privacy map, a function that computes the privacy loss \(\epsilon\) for any choice of sensitivity \(\Delta\).
\[map(\Delta) = \Delta / scale \le \epsilon\]
This map only promises that the privacy loss will be at most \(\epsilon\) if inputs from any two neighboring datasets may differ by no more than some quantity \(\Delta\) under the absolute distance input metric
absolute_distance(T=f64)
.We similarly describe units on the output (\(\epsilon\)) via the output measure
max_divergence()
.
The OpenDP Library consists of constructor functions that can be
called with simple arguments and always return valid measurements. The
make_laplace
constructor function returns the equivalent of the
Laplace measurement described above. Since it returns a measurement, you
can find it under dp.m
:
>>> # call the constructor to produce the measurement `base_lap`
>>> base_lap = dp.m.make_laplace(
... dp.atom_domain(T=float),
... dp.absolute_distance(T=float),
... scale=5.
... )
The supporting elements on this transformation match those described above:
>>> print("input domain: ", base_lap.input_domain)
input domain: AtomDomain(T=f64)
>>> print("input metric: ", base_lap.input_metric)
input metric: AbsoluteDistance(f64)
>>> print("output measure:", base_lap.output_measure)
output measure: MaxDivergence
We now invoke the measurement on some aggregate 0.
, to sample
\(Laplace(\mu=0., scale=5.)\) noise:
>>> aggregate = 0.
>>> print("noisy aggregate:", base_lap(aggregate))
noisy aggregate: ...
If we are using base_lap
on its own, we must know the sensitivity of
aggregate
(i.e. how much the aggregate can differ on two adjacent
datasets) to determine epsilon. In this case, we know base_lap
has
an absolute distance input metric, so the sensitivity should represent
the greatest possible absolute distance between aggregates on adjacent
datasets.
>>> absolute_distance = 10.
>>> print("epsilon:", base_lap.map(d_in=absolute_distance))
epsilon: 2.0
This tells us that when the sensitivity is 10
, and the noise scale
is 5
, the epsilon consumption of a release is 2
.
Transformation Example: Sum#
We package computations with bounded stability into transformations.
A transformation that computes the sum of a vector dataset contains a very similar set of six elements:
We first define the function \(f(\cdot)\), that computes the sum of some argument \(x\).
\[f(x) = \sum x_i\]
\(f(\cdot)\) is only well-defined for any vector input of a specific type. Each element must be bounded between some lower bound
L
and upper boundU
. Thus the input domain is of typevector_domain(atom_domain(T=f64))
with elements restricted betweenL
andU
.The output domain consists of any single finite
f64
scalar:atom_domain(T=f64)
.The sum transformation has a stability guarantee in terms of sensitivity. This guarantee is represented by a stability map, which is a function that computes the stability \(d_{out}\) for any choice of dataset distance \(d_{in}\). In this case \(d_{out}\) is in terms of the sensitivity.
\[map(d_{in}) = d_{in} \cdot \max(|L|, U) \le d_{out}\]
This map only promises a sensitivity of \(d_{out}\) under the assumption that neighboring datasets differ by no more than some quantity \(d_{in}\) under the symmetric distance input metric
symmetric_distance()
.The sensitivity is computed with respect to the absolute distance. This gives units to the output (\(d_{out}\)) via the output metric
absolute_distance(T=f64)
.
make_sum
constructs the equivalent of the sum transformation
described above. It is important to note that since the bounds are
float, the resulting transformation is calibrated to work for
floating-point numbers. You will need to be careful and intentional
about the types you use. Since it returns a transformation, you can find
it under dp.t
:
>>> # call the constructor to produce the transformation `bounded_sum`
>>> # notice that `make_sum` expects an input domain consisting of bounded data:
>>> input_domain = dp.vector_domain(dp.atom_domain(bounds=(0., 5.)))
>>> bounded_sum = dp.t.make_sum(input_domain, dp.symmetric_distance())
According to the documentation, this transformation expects a vector of
data with non-null elements bounded between 0.
and 5.
. We now
invoke the transformation on some mock dataset that satisfies this
constraint. Remember that since this component is a transformation, and
not a measurement, the resulting output is not differentially private.
>>> # under the condition that the input data is a member of the input domain...
>>> bounded_mock_dataset = [1.3, 3.8, 0., 5.]
>>> # ...the exact sum is:
>>> bounded_sum(bounded_mock_dataset)
10.1
It can help to understand a simple example of how a stability map works, but going forward you don’t need to understand why the maps give the numbers they give in order to use the library.
The stability argument for this transformation’s advertised sensitivity goes roughly as follows:
If the input data consists of numbers bounded between 0. and 5.,then the addition or removal of any one row can influence the sum by \(max(|0.|, 5.)\).In addition, if one individual may contribute up to k rows,then the sensitivity should further be multiplied by k.
In practice, the calculated sensitivity may be larger under certain conditions to account for the inexactness of arithmetic on finite data types.
>>> # under the condition that one individual may contribute up to 2 records to `bounded_mock_dataset`...
>>> max_contributions = 2
>>> # ...then the sensitivity, expressed in terms of the absolute distance, is:
>>> bounded_sum.map(d_in=max_contributions)
10.0...
As we would expect, the sensitivity is roughly 2 * max(|0.|, 5.)
.
Transformation Example: Clamp#
The sum transformation has an input domain of vectors with bounded elements. We now construct a transformation that clamps/clips each element to a given set of bounds.
Instead of listing the components of a clamp transformation as I’ve done
above, going forward you can check the **Supporting Elements**
section of the relevant API documentation entry:
>>> help(dp.t.make_clamp)
Help on function make_clamp in module opendp.transformations:
...
Documentation for specific types may be found behind the following links:
>>> input_domain = dp.vector_domain(dp.atom_domain(T=float))
>>> input_metric = dp.symmetric_distance()
>>> # call the constructor to produce the transformation `clamp`
>>> clamp = dp.t.make_clamp(input_domain, input_metric, bounds=(0., 5.))
>>> # `clamp` expects vectors of non-null, unbounded elements
>>> mock_dataset = [1.3, 7.8, -2.5, 7.0]
>>> # `clamp` emits data that is suitable for `bounded_sum`
>>> clamp(mock_dataset)
[1.3, 5.0, 0.0, 5.0]
According to the API documentation, the input and output metric is set by the user. We passed in a symmetric distance metric. Therefore, the stability map accepts a dataset distance describing the maximum number of contributions an individual may make, and emits the same.
The stability argument for the clamp transformation is very simple:
If an individual may influence at most k records in a dataset, then after clamping each element,an individual may still influence at most k records in a dataset.
>>> # dataset distance in... dataset distance out
>>> clamp.map(max_contributions)
2
Chaining#
The OpenDP library supports chaining a transformation with a transformation to produce a compound transformation, or a transformation with a measurement to produce a compound measurement.
When any two compatible computations are chained, all six components of each primitive are used to construct the new primitive.
A measurement produced from chaining a transformation with a measurement contains the same set of six elements as in previous examples:
A function \(f(\cdot)\). When you chain, the output domain of the transformation must match the input domain of the measurement.
\[f(x) = measurement(transformation(x))\]
The input domain from the transformation.
The output domain from the measurement.
A privacy_map \(map(\cdot)\). When you chain, the output metric of the transformation must match the input metric of the measurement.
\[map(d_{in}) = measurement.map(transformation.map(d_{in}))\]
The input metric from the transformation.
The output measure from the measurement.
A similar logic is used when chaining a transformation with a transformation.
We know that the
output domain of
bounded_sum
matches the input domain ofbase_lap
, and theoutput metric of
bounded_sum
matches the input metric ofbase_lap
.
The same holds for clamp
and bounded_sum
. Therefore, we can
chain all of these primitives to form a new compound measurement:
>>> dp_sum = clamp >> bounded_sum >> base_lap
>>> # compute the DP sum of a dataset of bounded elements
>>> print("DP sum:", dp_sum(mock_dataset))
DP sum: ...
>>> # evaluate the privacy loss of the dp_sum, when an individual can contribute at most 2 records
>>> print("epsilon:", dp_sum.map(d_in=max_contributions))
epsilon: ...
Retrospective#
Now that you have a more thorough understanding of what’s going on, we can breeze through an entire release:
>>> # establish public info
>>> max_contributions = 2
>>> bounds = (0., 5.)
>>> # construct the measurement
>>> dp_sum = (
... dp.t.make_clamp(dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance(), bounds) >>
... dp.t.make_sum(dp.vector_domain(dp.atom_domain(bounds=bounds)), dp.symmetric_distance()) >>
... dp.m.make_laplace(dp.atom_domain(T=float), dp.absolute_distance(T=float), 5.)
... )
>>> # evaluate the privacy expenditure and make a DP release
>>> mock_dataset = [0.7, -0.3, 1., -1.]
>>> print("epsilon:", dp_sum.map(max_contributions))
epsilon: ...
>>> print("DP sum release:", dp_sum(mock_dataset))
DP sum release: ...
Partial Constructors#
You may notice some redundancy in the code for dp_sum
above: The
output domain of a transformation will always match the input of its
successor. We can make this shorter by using then_*
constructors:
These are paired with make_*
constructors, but delay application of
the input_domain
and input_metric
arguments. We can rewrite
dp_sum
in an equivalent but more concise form:
>>> dp_sum = (
... (input_domain, input_metric) >>
... dp.t.then_clamp((0., 5.)) >>
... dp.t.then_sum() >>
... dp.m.then_laplace(5.)
... )
You’ll notice that the start of the chain is special: We provide a tuple
to specify the input_domain
and input_metric
for then_clamp
.