Supporting Elements#
This section builds on the Core Structures documentation to expand on the constituent pieces of Measurements and Transformations.
Function#
As one would expect, all data processing is handled via a function. The function member stored in a Transformation or Measurement struct is straightforward representation of an idealized mathematical function. A mathematical function is a binary relation between two sets that associates each value in the input set with a value in the output set. In OpenDP, we capture these sets with domains…
Domains#
A domain describes the set of all possible input values of a function, or all possible output values of a function.
Two domains (input_domain
and output_domain
) are bundled within each Transformation or Measurement to describe all possible inputs and outputs of the function.
Some common domains are:
- AllDomain<T>:
- The set of all non-null values in the type T.For example,
AllDomain<u8>
describes the set of all possible unsigned 8-bit integers:{0, 1, 2, 3, ..., 127}
. - BoundedDomain<T>:
- The set of all non-null values in the type T, bounded between some L and U.For example,
BoundedDomain<i32>
between -2 and 2:{-2, -1, 0, 1, 2}
. - VectorDomain<D>:
- The set of all vectors, where each element of the vector is a member of domain D.For example,
VectorDomain<AllDomain<bool>>
describes the set of all boolean vectors:{[True], [False], [True, True], [True, False], ...}
. - SizedDomain<D>:
- The set of all values in the domain D, that have a specific size.For example,
SizedDomain<VectorDomain<AllDomain<bool>>>
of size 2 describe the set of boolean vectors:{[True, True], [True, False], [False, True], [False, False]}
.
In many cases, you provide some qualities about the underlying domain and the rest is automatically chosen by the constructor.
Let’s look at the Transformation returned from make_bounded_sum(bounds=(0, 1))
.
The input domain has type VectorDomain<BoundedDomain<i32>>
,
read as “the set of all vectors of 32-bit signed integers bounded between 0 and 1.”
The bounds argument to the constructor provides L and U, and since TIA (atomic input type) is not passed,
TIA is inferred from the type of the public bounds.
The output domain is simply AllDomain<i32>
, or “the set of all 32-bit signed integers.”
These domains serve two purposes:
The relation depends on the input and output domain in its proof to restrict the set of neighboring datasets or distributions. An example is the relation for
opendp.transformations.make_sized_bounded_sum()
, which makes use of aSizedDomain
domain descriptor to more tightly bound the sensitivity.Combinators also use domains to ensure the output is well-defined. For instance, chainer constructors check that intermediate domains are equivalent to guarantee that the output of the interior function is always a valid input to the exterior function.
Metrics#
A metric is a function that computes the distance between two elements of a set.
A concrete example of a metric in opendp is SymmetricDistance
, or “the symmetric distance metric |A △ B| = |(A\B) ∪ (B\A)|
.”
This is used to count the fewest number of additions or removals to convert one dataset A
into another dataset B
.
Each metric is bundled together with a domain, and A
and B
are members of that domain.
Since the symmetric distance metric is often paired with a VectorDomain<D>
, A
and B
are often vectors.
In practice, if we had a dataset where each user can influence at most k records, we would say that the symmetric distance is bounded by k, so d_in=k
.
Another example metric is AbsoluteDistance<f64>
.
This can be read as “the absolute distance metric |A - B|
, where distances are expressed in 64-bit floats.”
This metric is used to represent global sensitivities
(an upper bound on how much an aggregated value can change if you were to perturb an individual in the original dataset).
In practice, most users will not have a need to provide global sensitivities to privacy relations,
because they are a midway distance bound encountered while relating dataset distances and privacy distances.
However, there are situations where constructors accept a metric for specifying the metric for sensitivities.
Measures#
In OpenDP, a measure is a function for measuring the distance between probability distributions.
A concrete example is MaxDivergence<f64>
,
read as “the max divergence metric where numbers are expressed in terms of 64-bit floats.”
The max divergence measure has distances that correspond to epsilon
in the pure definition of differential privacy.
Another example is SmoothedMaxDivergence<f64>
.
The smoothed max divergence measure corresponds to approximate differential privacy,
where distances are (epsilon, delta)
tuples.
Every Measurement (see listing) contains an output_measure, and compositors are always typed by a Measure.
Relations#
We assert the privacy properties of a Transformation or Measurement’s function via a relation.
Relations accept a d_in
and a d_out
and return a boolean.
There are a couple equivalent interpretations for when a relation returns True:
All potential input perturbations do not significantly influence the output.
The transformation or measurement is (
d_in
,d_out
)-close.
What does (d_in
, d_out
)-close mean?
If a measurement is (d_in
, d_out
)-close,
then the output is d_out
-DP when the input is changed by at most d_in
.
If a transformation is (d_in
, d_out
)-close,
then the output can change by at most d_out
when the input is changed by at most d_in
.
What are d_in
and d_out
?
d_in
and d_out
are distances in terms of the input and output metric or measure.
Refer to Distances below for more details.
This should be enough rope to work with, but let’s still touch quickly on the mathematical side.
Refer to the programming framework paper itself if you want a deeper understanding.
Consider d_X
the input metric, d_Y
the output metric or measure,
and f
the function in the Transformation or Measurement.
A slightly more mathematical way to express this is:
If the relation passes, then it tells you that, for all x
, x'
in the input domain:
if
d_X(x, x') <= d_in
(if neighboring datasets are at mostd_in
-close)then
d_Y(f(x), f(x')) <= d_out
(then the distance between function outputs is no greater thand_out
)
Notice that if the relation passes at d_out
, it will pass for any value greater than d_out
.
This is an incredibly useful observation, as we will see in the Parameter Search section.
Putting this to practice, the following example checks the stability relation on a clamp transformation.
>>> from opendp.transformations import make_clamp
>>> clamp = make_clamp(bounds=(1, 10))
...
>>> # The maximum number of records that any one individual may influence in your dataset
>>> in_symmetric_distance = 3
>>> # clamp is a 1-stable transformation, so this should pass for any symmetric_distance >= 3
>>> assert clamp.check(d_in=in_symmetric_distance, d_out=4)
Maps#
A map is a function that takes some d_in
and returns the smallest d_out
that is (d_in
, d_out
)-close.
Maps are a useful shorthand to find privacy properties directly:
>>> # reusing the prior clamp transformation
>>> clamp.map(d_in=3)
3
The relation check predicate function simply compares the output of the map with d_out
as follows: d_out >= map(d_in)
.
For a more thorough understanding of maps, please read the relations section.
Distances#
You can determine what units d_in
and d_out
are expressed in based on the input_metric
, and output_metric
or output_measure
.
Follow the links into the example metrics and measures to get more detail on what the distances mean for that kind of metric or measure.
On Transformations, the input_metric
will be a dataset metric like SymmetricDistance.
The output_metric
will either be some dataset metric (on dataset transformations)
or some kind of global sensitivity metric like AbsoluteDistance (on aggregations).
The input_metric
of Measurements is initially only some kind of global sensitivity metric.
However, once you chain the Measurement with a Transformation, the resulting Measurement will have whatever input_metric
was on the Transformation.
The output_measure
of Measurements is some kind of privacy measure like MaxDivergence or SmoothedMaxDivergence.
It is critical that you choose the correct d_in
for the relation,
whereas you can use binary search utilities to find the tightest d_out
.
Practically speaking, the smaller the d_out
, the tighter your analysis will be.
You might find it surprising that metrics and measures are never actually evaluated! The framework does not evaluate these because it only needs to relate a user-provided input distance to another user-provided output distance. Even the user should not directly compute input and output distances: they are solved-for, bisected, or even contextual.
Be careful: even a dataset query to determine the greatest number of contributions made by any one individual can itself be private information.