Example Data Analysis#
In this notebook we use OpenDP to conduct a brief data analysis of a PUMS microdata sample from the US Census Bureau. The data consists of demographic information of 1000 individuals. Each individual in the dataset contributes at most one row.
[33]:
# the greatest number of records that any one individual can influence in the dataset
max_influence = 1
We now establish what is considered “public knowledge” about the data.
[34]:
# establish public information
col_names = ["age", "sex", "educ", "race", "income", "married"]
# we can also reasonably intuit that age and income will be numeric,
# as well as bounds for them, without looking at the data
age_bounds = (0, 100)
income_bounds = (0, 150_000)
The vetting process is currently underway for the code in the OpenDP Library. Any constructors that have not been vetted may still be accessed if you opt-in to “contrib”.
[35]:
from opendp.mod import enable_features
enable_features('contrib')
Working with CSV data#
Let’s examine how we can process csv data with Transformations.
I’m going to pull a few constructors from the Dataframes section in the user guide.
We start with make_split_dataframe
to parse one large string containing all the csv data into a dataframe. make_split_dataframe
expects us to pass column names, which we can grab out of the public information. make_select_column
will then index into the dataframe to pull out a column where the elements have a given type TOA
. The TOA
argument won’t cast for you; the casting comes later!
[36]:
from opendp.transformations import make_split_dataframe, make_select_column
income_preprocessor = (
# Convert data into a dataframe where columns are of type Vec<str>
make_split_dataframe(separator=",", col_names=col_names) >>
# Selects a column of df, Vec<str>
make_select_column(key="income", TOA=str)
)
For the sake of exposition, we’re going to go ahead and load up the data.
[37]:
import os
data_path = os.path.join('..', 'data', 'PUMS_california_demographics_1000', 'data.csv')
with open(data_path) as input_file:
data = input_file.read()
print('\n'.join(data.split('\n')[:6]))
59,1,9,1,0,1
31,0,1,3,17000,0
36,1,11,1,0,1
54,1,11,1,9100,1
39,0,5,3,37000,0
34,0,9,1,0,1
As we can see from the first few rows, it is intentional that there are no column names in the data. If your data has column names, you will want to strip them out before passing data into your function.
Now if you run the transformation on the data, you will get a list of incomes as strings. I’ve limited the output to just the first few income values.
[38]:
transformed = income_preprocessor(data)
print(type(transformed))
print(transformed[:6])
<class 'list'>
['0', '17000', '0', '9100', '37000', '0']
Casting#
Income doesn’t make sense as a string for our purposes, so we can just extend the previous preprocessor to also cast and impute.
[39]:
from opendp.transformations import make_cast, make_impute_constant
# make a transformation that casts from a vector of strings to a vector of ints
cast_str_int = (
# Cast Vec<str> to Vec<Option<int>>
make_cast(TIA=str, TOA=int) >>
# Replace any elements that failed to parse with 0, emitting a Vec<int>
make_impute_constant(0)
)
# replace the previous preprocessor: extend it with the caster
income_preprocessor = income_preprocessor >> cast_str_int
print(income_preprocessor(data)[:6])
[0, 17000, 0, 9100, 37000, 0]
Great! Now we have integer income data from our CSV. A quick aside, keep in mind that we can invoke transformations on almost anything to do some testing. For example, we still have a handle on cast_str_int
, don’t we?
[40]:
cast_str_int(["null", "1.", "2"])
[40]:
[0, 0, 2]
Private Count#
Time to compute our first aggregate statistic. Suppose we want to know the number of records in the dataset.
We can use the list of aggregators in the Transformation Constructors section of the user guide to find make_count
.
[41]:
from opendp.transformations import make_count
count = income_preprocessor >> make_count(TIA=int)
# NOT a DP release!
count_response = count(data)
Be careful! count
is still only a transformation, so the output in count_response
is not a differentially private release. You will need to chain with a measurement to create a differentially private release.
We use make_base_discrete_laplace
below, because it has an integer support. Notice that we now import from opendp.measurements
, and the resulting type(dp_count)
is Measurement
. This tells us that the output will be a differentially private release.
[42]:
from opendp.measurements import make_base_discrete_laplace
dp_count = count >> make_base_discrete_laplace(scale=1.)
In any realistic situation, you would likely want to estimate the budget utilization before you make a release. Use a search utility to quantify the privacy expenditure of this release.
[43]:
from opendp.mod import binary_search
# estimate the budget...
epsilon = binary_search(
lambda eps: dp_count.check(d_in=max_influence, d_out=eps),
bounds=(0., 100.))
print("DP count budget:", epsilon)
# ...and then release
count_release = dp_count(data)
print("DP count:", count_release)
DP count budget: 1.0
DP count: 999
Private Sum#
Suppose we want to know the total income of our dataset. First, take a look at the list of aggregators. make_bounded_sum
seems to meet our requirements. As indicated by the function’s API documentation, it expects bounded data, so we’ll also need to chain the transformation from make_clamp
with the income_preprocessor
.
[44]:
from opendp.transformations import make_clamp, make_bounded_sum
bounded_income_sum = (
income_preprocessor >>
# Clamp income values
make_clamp(bounds=income_bounds) >>
# These bounds must be identical to the clamp bounds, otherwise chaining will fail
make_bounded_sum(bounds=income_bounds)
)
In this example, instead of just passing a scale into make_base_discrete_laplace
, I want whatever scale will make my measurement 1-epsilon DP. Again, I can use a search utility to find such a scale.
[45]:
from opendp.mod import binary_search_param
discovered_scale = binary_search_param(
lambda s: bounded_income_sum >> make_base_discrete_laplace(scale=s),
d_in=max_influence,
d_out=1.)
dp_sum = bounded_income_sum >> make_base_discrete_laplace(scale=discovered_scale)
Or more succinctly…
[46]:
from opendp.mod import binary_search_chain
dp_sum = binary_search_chain(
lambda s: bounded_income_sum >> make_base_discrete_laplace(scale=s),
d_in=max_influence,
d_out=1.)
# ...and make our 1-epsilon DP release
print("DP sum:", dp_sum(data))
DP sum: 30162185
Private Mean#
We may be more interested in the mean age. Just like before, we reference the docs to find make_sized_bounded_mean
. The name of the constructor indicates that it expects sized, bounded data, and the docstring points us toward preprocessors we can use.
Sized data is data that has a known number of rows. The constructor enforces this requirement because knowledge of the dataset size was necessary to establish the privacy proof.
Luckily, we’ve already made a DP release of the number of rows in the dataset, which we can reuse as an argument here.
Putting the previous sections together, our bounded mean age is:
[47]:
from opendp.transformations import make_cast_default, make_bounded_resize, make_sized_bounded_mean
from opendp.mod import OpenDPException
try:
mean_age_preprocessor = (
# Convert data into a dataframe of string columns
make_split_dataframe(separator=",", col_names=col_names) >>
# Selects a column of df, Vec<str>
make_select_column(key="age", TOA=str) >>
# Cast the column as Vec<float>, and fill nulls with the default value, 0.
make_cast_default(TIA=str, TOA=float) >>
# Clamp age values
make_clamp(bounds=age_bounds)
)
except OpenDPException as err:
assert err.message.startswith("Intermediate domains don't match.")
print(err.message)
Intermediate domains don't match. See https://github.com/opendp/opendp/discussions/297
output_domain: VectorDomain(AllDomain(f64))
input_domain: VectorDomain(AllDomain(i32))
Wait a second! The intermediate domains don’t match? In this case, we casted to float-valued data, but make_clamp
was built with integer-valued bounds, so the clamp is expecting integer data. Therefore, the output of the cast is not a valid input to the clamp. We can fix this by adjusting the bounds and trying again.
[48]:
float_age_bounds = tuple(map(float, age_bounds))
mean_age_preprocessor = (
# Convert data into a dataframe of string columns
make_split_dataframe(separator=",", col_names=col_names) >>
# Selects a column of df, Vec<str>
make_select_column(key="age", TOA=str) >>
# Cast the column as Vec<float>, and fill nulls with the default value, 0.
make_cast_default(TIA=str, TOA=float) >>
# Clamp age values
make_clamp(bounds=float_age_bounds) >>
# Resize the dataset to length `count_release`.
# If there are fewer than `count_release` rows in the data, fill with a constant of 20.
# If there are more than `count_release` rows in the data, only keep `count_release` rows
make_bounded_resize(size=count_release, bounds=float_age_bounds, constant=20.) >>
# Compute the mean
make_sized_bounded_mean(size=count_release, bounds=float_age_bounds)
)
I stopped just short of chaining with a measurement because we’re working with float data. There are some extra considerations to take in mind with floating-point data, which are covered in the user guide.
With the assumption that you understand the ramifications, I’ll go ahead and finish the query.
[49]:
from opendp.measurements import make_base_laplace
from opendp.mod import enable_features
enable_features("floating-point")
# add laplace noise
dp_mean = mean_age_preprocessor >> make_base_laplace(scale=1.0)
mean_release = dp_mean(data)
print("DP mean:", mean_release)
DP mean: 45.10625466380808
Depending on your use-case, you may find greater utility separately releasing a DP sum and a DP count, and then postprocessing them into the mean. In the above mean example, you could even take advantage of this to avoid using floating-point numbers.
Zero-Concentrated Differential Privacy#
In this example, I chain with the gaussian mechanism instead, with a budget of .05 rho.
[50]:
from opendp.measurements import make_base_gaussian
variance = (
# Convert data into a dataframe of string columns
make_split_dataframe(separator=",", col_names=col_names) >>
# Selects a column of df, Vec<str>
make_select_column(key="age", TOA=str) >>
# Cast the column as Vec<float>, and fill nulls with the default value, 0.
make_cast_default(TIA=str, TOA=float) >>
# Clamp age values
make_clamp(bounds=float_age_bounds) >>
# Resize the dataset to length `count_release`.
make_bounded_resize(size=count_release, bounds=float_age_bounds, constant=20.) >>
# Compute the mean
make_sized_bounded_mean(size=count_release, bounds=float_age_bounds)
)
dp_variance = binary_search_chain(
lambda s: variance >> make_base_gaussian(scale=s),
d_in=max_influence, d_out=.05)
print("DP variance:", dp_variance(data))
DP variance: 44.5986145355549
Measure Casting#
In the previous example, we have a privacy parameter in terms of rho. We can use make_zCDP_to_approxDP
to convert the distance type to an ε(δ) pareto-curve.
[51]:
from opendp.combinators import make_fix_delta, make_zCDP_to_approxDP
app_dp_variance = make_zCDP_to_approxDP(dp_variance)
# evaluate the privacy map to get a curve
curve = app_dp_variance.map(max_influence)
# solve for epsilon when delta is fixed
curve.epsilon(delta=1e-7)
[51]:
1.6194085342284823
We can use make_fix_delta
to emit (ε, δ) pairs instead:
[52]:
from opendp.combinators import make_fix_delta
fixed_app_dp_variance = make_fix_delta(app_dp_variance, delta=1e-7)
fixed_app_dp_variance.map(max_influence)
[52]:
(1.6194085342284823, 1e-07)
This can be used in conjunction with the binary search utilities to solve for a scale parameter:
[53]:
budget = (1., 1e-7)
def make_dp_variance(scale):
adp = make_zCDP_to_approxDP(make_base_gaussian(scale))
return variance >> make_fix_delta(adp, delta=budget[1])
dp_variance_lte_budget = binary_search_chain(
make_dp_variance,
d_in=max_influence, d_out=budget)
# we know this measurement is calibrated to be lte budget
assert dp_variance_lte_budget.check(max_influence, budget)
Composition#
We can compose multiple measurements into one measurement with make_basic_composition
:
[54]:
from opendp.combinators import make_basic_composition
composed = make_basic_composition([dp_sum, dp_mean])
composed(data)
[54]:
[30292669, 44.980434787860254]
In order to compose, all measurements must share the same input domain, input metric and output measure. We can still use the privacy map to see the epsilon usage of this new measurement:
[56]:
composed.map(max_influence)
[56]:
1.100100100100557
Population Amplification#
Another type of combinator is an amplifier. In this example I’ll apply the amplifier to a dp variance estimator:
[55]:
from opendp.transformations import make_sized_bounded_variance
from opendp.combinators import make_population_amplification
variance = make_sized_bounded_variance(size=count_release, bounds=float_age_bounds)
dp_variance = binary_search_chain(
lambda s: variance >> make_base_laplace(scale=s),
d_in=max_influence,
d_out=1.
)
# requires a looser trust model, as the population size can be set arbitrarily
enable_features("honest-but-curious")
make_population_amplification(dp_variance, 100_000).map(1)
[55]:
0.01701997053694077
You’ll notice that we found a dp variance estimator that was 1 epsilon-DP, but after amplification, it now uses a much smaller epsilon. We are taking advantage of the knowledge that the dataset was a simple sample from a larger population with at least 100,000 individuals.