Typical Workflow#
A differentially private analysis in OpenDP typically has the following steps:
Identify the unit of privacy
Set privacy loss parameters
Collect public information
Mediate access to data
Submit DP queries
We’ll illustrate these steps by doing a differentially private analysis of a teacher survey, which is a tabular dataset. The raw data consists of survey responses from teachers in primary and secondary schools in an unspecified U.S. state.
1. Identify the Unit of Privacy#
The first step in a differentially private analysis is to determine what you are protecting: the unit of privacy.
Releases on the teacher survey should conceal the addition or removal of any one teacher’s data, and each teacher contributes at most one row to the data set, so the unit of privacy corresponds to one row contribution.
>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")
>>> privacy_unit = dp.unit_of(contributions=1)
>>> input_metric, d_in = privacy_unit
>>> import opendp.prelude as dp
>>> dp.enable_features("contrib")
>>> d_in = 1 # neighboring data set distance is at most d_in...
>>> input_metric = dp.symmetric_distance() # ...in terms of additions/removals
library(opendp)
enable_features("contrib")
d_in <- 1L # neighboring data set distance is at most d_in...
input_metric <- symmetric_distance() # ...in terms of additions/removals
The privacy unit specifies how distances are computed between two data sets (input_metric
), and how large the distance can be (d_in
).
Broadly speaking, differential privacy can be applied to any medium of data for which you can define a unit of privacy. In other contexts, the unit of privacy may correspond to multiple rows, a user ID, or nodes or edges in a graph.
The unit of privacy may also be more general or more precise than a single individual.
more general: unit of privacy is an entire household, or a company
more precise: unit of privacy is a person-month, or device
It is highly recommended to choose a unit of privacy that is at least as general as an individual.
2. Set Privacy Loss Parameters#
Next, you should determine what level of privacy protection to provide to your units of privacy. This choice may be governed by a variety of factors, such as the amount of harm that individuals could experience if their data were revealed, and your ethical and legal obligations as a data custodian.
The level of privacy afforded to units of privacy in a data set is quantified by privacy loss parameters. Under pure differential privacy, there is a single privacy-loss parameter, typically denoted epsilon (ε). Epsilon is a non-negative number, where larger values afford less privacy. Epsilon can be viewed as a proxy for the worst-case risk to a unit of privacy. It is customary to refer to a data release with such bounded risk as epsilon-differentially private (ε-DP).
A common rule-of-thumb is to limit ε to 1.0, but this limit will vary depending on the considerations mentioned above. See Hsu et. al for a more elaborate discussion on setting epsilon.
>>> privacy_loss = dp.loss_of(epsilon=1.)
>>> privacy_measure, d_out = privacy_loss
>>> d_out = 1. # output distributions have distance at most d_out (ε)...
>>> privacy_measure = dp.max_divergence(T=float) # ...in terms of pure-DP
d_out <- 1.0 # output distributions have distance at most d_out (ε)...
privacy_measure <- max_divergence(.T = f64) # ...in terms of pure-DP
The privacy loss specifies how distances are measured between distributions (privacy_measure
), and how large the distance can be (d_out
).
3. Collect Public Information#
The next step is to identify public information about the data set.
Information that is invariant across all potential input data sets (may include column names and per-column categories)
Information that is publicly available from other sources
Information from other DP releases
This is the same under either API.
>>> col_names = [
... "name", "sex", "age", "maritalStatus", "hasChildren", "highestEducationLevel",
... "sourceOfStress", "smoker", "optimism", "lifeSatisfaction", "selfEsteem"
... ]
>>> col_names = [
... "name", "sex", "age", "maritalStatus", "hasChildren", "highestEducationLevel",
... "sourceOfStress", "smoker", "optimism", "lifeSatisfaction", "selfEsteem"
... ]
col_names <- c(
"name", "sex", "age", "maritalStatus", "hasChildren", "highestEducationLevel",
"sourceOfStress", "smoker", "optimism", "lifeSatisfaction", "selfEsteem"
)
In this case (and in most cases), we consider column names public/invariant to the data because they weren’t picked in response to the data, they were “fixed” before collecting the data.
A data invariant is information about your data set that you are explicitly choosing not to protect, typically because it is already public or non-sensitive. Be careful, if an invariant does contain sensitive information, then you risk violating the privacy of individuals in your data set.
On the other hand, using public information significantly improves the utility of your results.
4. Mediate Access to Data#
Ideally, at this point, you have not yet accessed the sensitive data set. This is the only point in the process where we access the sensitive data set. To ensure that your specified differential privacy protections are maintained, the OpenDP Library should mediate all access to the sensitive data set. When using Python, use the Context API to mediate access.
>>> import urllib.request
>>> data_url = "https://raw.githubusercontent.com/opendp/opendp/sydney/teacher_survey.csv"
>>> with urllib.request.urlopen(data_url) as data_req:
... data = data_req.read().decode('utf-8')
>>> context = dp.Context.compositor(
... data=data,
... privacy_unit=privacy_unit,
... privacy_loss=privacy_loss,
... split_evenly_over=3
... )
Since the privacy loss budget is at most ε = 1, and we are partitioning our budget evenly amongst three queries, then each query will be calibrated to satisfy ε = 1/3.
>>> import urllib.request
>>> data_url = "https://raw.githubusercontent.com/opendp/opendp/sydney/teacher_survey.csv"
>>> with urllib.request.urlopen(data_url) as data_req:
... data = data_req.read().decode('utf-8')
>>> m_sc = dp.c.make_sequential_composition(
... # data set is a single string, with rows separated by linebreaks
... input_domain=dp.atom_domain(T=str),
... input_metric=input_metric,
... output_measure=privacy_measure,
... d_in=d_in,
... d_mids=[d_out / 3] * 3,
... )
>>> # Call measurement with data to create a queryable:
>>> qbl_sc = m_sc(data)
dp.Context.compositor
creates a sequential composition measurement.
You can now submit up to three queries to qbl_sc
, in the form of measurements.
temp_file <- "teacher_survey.csv"
download.file("https://raw.githubusercontent.com/opendp/opendp/sydney/teacher_survey.csv", temp_file)
data_string <- paste(readLines(temp_file), collapse = "\n")
file.remove(temp_file)
m_sc <- make_sequential_composition(
input_domain = atom_domain(.T = String),
input_metric = input_metric,
output_measure = privacy_measure,
d_in = d_in,
d_mids = rep(d_out / 3L, 3L)
)
# Call measurement with data to create a queryable:
qbl_sc <- m_sc(arg = data_string) # Different from Python, which does not require "arg".
dp.Context.compositor
creates a sequential composition measurement.
You can now submit up to three queries to qbl_sc
, in the form of measurements.
5. Submit DP Queries#
You can now create differentially private releases. Here’s a differentially private count:
>>> count_query = (
... context.query()
... .split_dataframe(",", col_names=col_names)
... .select_column("age", str) # temporary until OpenDP 0.10 (Polars dataframe)
... .count()
... .laplace()
... )
>>> scale = count_query.param()
>>> scale
3.0000000000000004
>>> accuracy = dp.discrete_laplacian_scale_to_accuracy(scale=scale, alpha=0.05)
>>> accuracy
9.445721638273584
>>> dp_count = count_query.release()
>>> interval = (dp_count - accuracy, dp_count + accuracy)
>>> count_transformation = (
... dp.t.make_split_dataframe(",", col_names=col_names)
... >> dp.t.make_select_column("age", str)
... >> dp.t.then_count()
... )
>>> count_sensitivity = count_transformation.map(d_in)
>>> count_sensitivity
1
>>> count_measurement = dp.binary_search_chain(
... lambda scale: count_transformation >> dp.m.then_laplace(scale), d_in, d_out / 3
... )
>>> dp_count = qbl_sc(count_measurement)
count_transformation <- (
make_split_dataframe(",", col_names = col_names)
|> then_select_column("age", String) # Different from Python, which uses "make_".
|> then_count()
)
count_sensitivity <- count_transformation(d_in = d_in) # Different from Python, which uses ".map".
cat("count_sensitivity:", count_sensitivity, "\n")
# 1
count_measurement <- binary_search_chain(
function(scale) count_transformation |> then_laplace(scale), d_in, d_out / 3L
)
dp_count <- qbl_sc(query = count_measurement) # Different from Python, which does not require "query".
cat("dp_count:", dp_count, "\n")
Here’s a differentially private mean:
>>> mean_query = (
... context.query()
... .split_dataframe(",", col_names=col_names)
... .select_column("age", str)
... .cast_default(float)
... .clamp((18.0, 70.0)) # a best-guess based on public information
... # Explanation for `constant=42`:
... # since dp_count may be larger than the true size,
... # imputed rows will be given an age of 42.0
... # (also a best guess based on public information)
... .resize(size=dp_count, constant=42.0)
... .mean()
... .laplace()
... )
>>> dp_mean = mean_query.release()
>>> mean_transformation = (
... dp.t.make_split_dataframe(",", col_names=col_names) >>
... dp.t.make_select_column("age", str) >>
... dp.t.then_cast_default(float) >>
... dp.t.then_clamp((18.0, 70.0)) >> # a best-guess based on public information
... dp.t.then_resize(size=dp_count, constant=42.0) >>
... dp.t.then_mean()
... )
>>> mean_measurement = dp.binary_search_chain(
... lambda scale: mean_transformation >> dp.m.then_laplace(scale), d_in, d_out / 3
... )
>>> dp_mean = qbl_sc(mean_measurement)
mean_transformation <- (
make_split_dataframe(",", col_names = col_names)
|> then_select_column("age", String)
|> then_cast_default(f64) # Different from Python, which just uses "float".
|> then_clamp(c(18.0, 70.0)) # a best-guess based on public information
|> then_resize(size = dp_count, constant = 42.0)
|> then_mean()
)
mean_measurement <- binary_search_chain(
function(scale) mean_transformation |> then_laplace(scale), d_in, d_out / 3L
)
dp_mean <- qbl_sc(query = mean_measurement) # Different from Python, which does not require "query".
cat("dp_mean:", dp_mean, "\n")
Other features#
The OpenDP Library supports more statistics, like the variance, various ways to compute histograms and quantiles, and PCA. The library also supports other mechanisms like the Gaussian Mechanism, which provides tighter privacy accounting when releasing a large number of queries, the Thresholded Laplace Mechanism, for releasing counts on data sets with unknown key sets, and variations of randomized response.