Getting Started#
This notebook is an introduction to programming in the OpenDP Library.
Before we get started, the notebook A Framework to Understand DP provides useful background. It explains
basic terminology like sensitivity and epsilon
how we define transformations, measurements, stability and privacy
Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to “contrib”. Please contact us if you are interested in proof-writing. Thank you!
[2]:
import opendp.prelude as dp
dp.enable_features("contrib")
The Laplace Mechanism#
The Laplace mechanism is a ubiquitous algorithm in the DP ecosystem that is typically used to privatize an aggregate, like a sum or mean.
An instance of the Laplace mechanism is captured by a measurement containing the following five elements:
We first define the function \(f(\cdot)\), that applies the Laplace mechanism to some argument \(x\). This function simply samples from the Laplace distribution centered at \(x\), with a fixed noise scale.
\[f(x) = Laplace(\mu=x, b=scale)\]
Importantly, \(f(\cdot)\) is only well-defined for any finite float input. This set of permitted inputs is described by the input domain
atom_domain(T=f64)
.The Laplace mechanism has a privacy guarantee in terms of epsilon. This guarantee is represented by a privacy map, a function that computes the privacy loss \(\epsilon\) for any choice of sensitivity \(\Delta\).
\[map(\Delta) = \Delta / scale \le \epsilon\]
This map only promises that the privacy loss will be at most \(\epsilon\) if inputs from any two neighboring datasets may differ by no more than some quantity \(\Delta\) under the absolute distance input metric
absolute_distance(T=f64)
.We similarly describe units on the output (\(\epsilon\)) via the output measure
max_divergence(T=f64)
.
The OpenDP Library consists of constructor functions that can be called with simple arguments and always return valid measurements. The make_base_laplace
constructor function returns the equivalent of the Laplace measurement described above. Since it returns a measurement, you can find it under dp.m
:
[2]:
# call the constructor to produce the measurement `base_lap`
base_lap = dp.m.make_base_laplace(
dp.atom_domain(T=float),
dp.absolute_distance(T=float),
scale=5.
)
The supporting elements on this transformation match those described above:
[3]:
print("input domain: ", base_lap.input_domain)
print("input metric: ", base_lap.input_metric)
print("output measure:", base_lap.output_measure)
input domain: AtomDomain(T=f64)
input metric: AbsoluteDistance(f64)
output measure: MaxDivergence(f64)
We now invoke the measurement on some aggregate 0.
, to sample \(Laplace(\mu=0., scale=5.)\) noise:
[4]:
aggregate = 0.
print("noisy aggregate:", base_lap(aggregate))
noisy aggregate: -3.462086040297409
If we are using base_lap
on its own, we must know the sensitivity of aggregate
(i.e. how much the aggregate can differ on two adjacent datasets) to determine epsilon. In this case, we know base_lap
has an absolute distance input metric, so the sensitivity should represent the greatest possible absolute distance between aggregates on adjacent datasets.
[5]:
absolute_distance = 10.
print("epsilon:", base_lap.map(d_in=absolute_distance))
epsilon: 2.0
This tells us that when the sensitivity is 10
, and the noise scale is 5
, the epsilon consumption of a release is 2
.
Transformation Example: Sum#
We package computations with bounded stability into transformations.
A transformation that computes the sum of a vector dataset contains a very similar set of six elements:
We first define the function \(f(\cdot)\), that computes the sum of some argument \(x\).
\[f(x) = \sum x_i\]
\(f(\cdot)\) is only well-defined for any vector input of a specific type. Each element must be bounded between some lower bound
L
and upper boundU
. Thus the input domain is of typevector_domain(atom_domain(T=f64))
with elements restricted betweenL
andU
.The output domain consists of any single finite
f64
scalar:atom_domain(T=f64)
.The sum transformation has a stability guarantee in terms of sensitivity. This guarantee is represented by a stability map, which is a function that computes the stability \(d_{out}\) for any choice of dataset distance \(d_{in}\). In this case \(d_{out}\) is in terms of the sensitivity.
\[map(d_{in}) = d_{in} \cdot \max(|L|, U) \le d_{out}\]
This map only promises a sensitivity of \(d_{out}\) under the assumption that neighboring datasets differ by no more than some quantity \(d_{in}\) under the symmetric distance input metric
symmetric_distance()
.The sensitivity is computed with respect to the absolute distance. This gives units to the output (\(d_{out}\)) via the output metric
absolute_distance(T=f64)
.
make_sum
constructs the equivalent of the sum transformation described above. It is important to note that since the bounds are float, the resulting transformation is calibrated to work for floating-point numbers. You will need to be careful and intentional about the types you use. Since it returns a transformation, you can find it under dp.t
:
[6]:
# call the constructor to produce the transformation `bounded_sum`
# notice that `make_sum` expects an input domain consisting of bounded data:
input_domain = dp.vector_domain(dp.atom_domain(bounds=(0., 5.)))
bounded_sum = dp.t.make_sum(input_domain, dp.symmetric_distance())
According to the documentation, this transformation expects a vector of data with non-null elements bounded between 0.
and 5.
. We now invoke the transformation on some mock dataset that satisfies this constraint. Remember that since this component is a transformation, and not a measurement, the resulting output is not differentially private.
[7]:
# under the condition that the input data is a member of the input domain...
bounded_mock_dataset = [1.3, 3.8, 0., 5.]
# ...the exact sum is:
bounded_sum(bounded_mock_dataset)
[7]:
10.1
It can help to understand a simple example of how a stability map works, but going forward you don’t need to understand why the maps give the numbers they give in order to use the library.
The stability argument for this transformation’s advertised sensitivity goes roughly as follows:
If the input data consists of numbers bounded between 0. and 5.,then the addition or removal of any one row can influence the sum by \(max(|0.|, 5.)\).In addition, if one individual may contribute up to k rows,then the sensitivity should further be multiplied by k.
In practice, the calculated sensitivity may be larger under certain conditions to account for the inexactness of arithmetic on finite data types.
[8]:
# under the condition that one individual may contribute up to 2 records to `bounded_mock_dataset`...
max_contributions = 2
# ...then the sensitivity, expressed in terms of the absolute distance, is:
bounded_sum.map(d_in=max_contributions)
[8]:
10.00000004656613
As we would expect, the sensitivity is roughly 2 * max(|0.|, 5.)
.
Transformation Example: Clamp#
The sum transformation has an input domain of vectors with bounded elements. We now construct a transformation that clamps/clips each element to a given set of bounds.
Instead of listing the components of a clamp transformation as I’ve done above, going forward you can check the **Supporting Elements**
section of the relevant API documentation entry:
[3]:
help(dp.t.make_clamp)
Help on function make_clamp in module opendp.transformations:
make_clamp(input_domain: opendp.mod.Domain, input_metric: opendp.mod.Metric, bounds: Tuple[Any, Any]) -> opendp.mod.Transformation
Make a Transformation that clamps numeric data in `Vec<TA>` to `bounds`.
If datum is less than lower, let datum be lower.
If datum is greater than upper, let datum be upper.
[make_clamp in Rust documentation.](https://docs.rs/opendp/latest/opendp/transformations/fn.make_clamp.html)
**Supporting Elements:**
* Input Domain: `VectorDomain<AtomDomain<TA>>`
* Output Domain: `VectorDomain<AtomDomain<TA>>`
* Input Metric: `M`
* Output Metric: `M`
**Proof Definition:**
[(Proof Document)](https://docs.opendp.org/en/latest/proofs/rust/src/transformations/clamp/make_clamp.pdf)
:param input_domain: Domain of input data.
:type input_domain: Domain
:param input_metric: Metric on input domain.
:type input_metric: Metric
:param bounds: Tuple of inclusive lower and upper bounds.
:type bounds: Tuple[Any, Any]
:rtype: Transformation
:raises TypeError: if an argument's type differs from the expected type
:raises UnknownTypeException: if a type argument fails to parse
:raises OpenDPException: packaged error from the core OpenDP library
Documentation for specific types may be found behind the following links:
[10]:
input_domain = dp.vector_domain(dp.atom_domain(T=float))
input_metric = dp.symmetric_distance()
# call the constructor to produce the transformation `clamp`
clamp = dp.t.make_clamp(input_domain, input_metric, bounds=(0., 5.))
# `clamp` expects vectors of non-null, unbounded elements
mock_dataset = [1.3, 7.8, -2.5, 7.0]
# `clamp` emits data that is suitable for `bounded_sum`
clamp(mock_dataset)
[10]:
[1.3, 5.0, 0.0, 5.0]
According to the API documentation, the input and output metric is set by the user. We passed in a symmetric distance metric. Therefore, the stability map accepts a dataset distance describing the maximum number of contributions an individual may make, and emits the same.
The stability argument for the clamp transformation is very simple:
If an individual may influence at most k records in a dataset, then after clamping each element,an individual may still influence at most k records in a dataset.
[11]:
# dataset distance in... dataset distance out
clamp.map(max_contributions)
[11]:
2
Chaining#
The OpenDP library supports chaining a transformation with a transformation to produce a compound transformation, or a transformation with a measurement to produce a compound measurement.
When any two compatible computations are chained, all six components of each primitive are used to construct the new primitive.
A measurement produced from chaining a transformation with a measurement contains the same set of six elements as in previous examples:
A function \(f(\cdot)\). When you chain, the output domain of the transformation must match the input domain of the measurement.
\[f(x) = measurement(transformation(x))\]
The input domain from the transformation.
The output domain from the measurement.
A privacy_map \(map(\cdot)\). When you chain, the output metric of the transformation must match the input metric of the measurement.
\[map(d_{in}) = measurement.map(transformation.map(d_{in}))\]
The input metric from the transformation.
The output measure from the measurement.
A similar logic is used when chaining a transformation with a transformation.
We know that the
output domain of
bounded_sum
matches the input domain ofbase_lap
, and theoutput metric of
bounded_sum
matches the input metric ofbase_lap
.
The same holds for clamp
and bounded_sum
. Therefore, we can chain all of these primitives to form a new compound measurement:
[12]:
dp_sum = clamp >> bounded_sum >> base_lap
# compute the DP sum of a dataset of bounded elements
print("DP sum:", dp_sum(mock_dataset))
# evaluate the privacy loss of the dp_sum, when an individual can contribute at most 2 records
print("epsilon:", dp_sum.map(d_in=max_contributions))
DP sum: 18.552569770564993
epsilon: 2.000000009313226
Retrospective#
Now that you have a more thorough understanding of what’s going on, we can breeze through an entire release:
[13]:
# establish public info
max_contributions = 2
bounds = (0., 5.)
# construct the measurement
dp_sum = (
dp.t.make_clamp(dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance(), bounds) >>
dp.t.make_sum(dp.vector_domain(dp.atom_domain(bounds=bounds)), dp.symmetric_distance()) >>
dp.m.make_laplace(dp.atom_domain(T=float), dp.absolute_distance(T=float), 5.)
)
# evaluate the privacy expenditure and make a DP release
mock_dataset = [0.7, -0.3, 1., -1.]
print("epsilon:", dp_sum.map(max_contributions))
print("DP sum release:", dp_sum(mock_dataset))
epsilon: 2.000000009313226
DP sum release: -5.612579341340162
Partial Constructors#
You may notice some redundancy in the code for dp_sum
above: The output domain of a transformation will always match the input of its successor. We can make this shorter by using then_*
constructors: These are paired with make_*
constructors, but delay application of the input_domain
and input_metric
arguments. We can rewrite dp_sum
in an equivalent but more concise form:
[ ]:
dp_sum = (
(input_domain, input_metric) >>
dp.t.then_clamp((0., 5.)) >>
dp.t.then_sum() >>
dp.m.then_laplace(5.)
)
You’ll notice that the start of the chain is special: We provide a tuple to specify the input_domain
and input_metric
for then_clamp
.
The next major sections of the documentation each cover a module where you can find constructors:
opendp.transformations
opendp.measurements
opendp.combinators