Privatizing Histograms#
Sometimes we want to release the counts of individual outcomes in a dataset. When plotted, this makes a histogram.
The library currently has two approaches:
Known category set
make_count_by_categories
Unknown category set
make_count_by
The next code block imports handles boilerplate: imports, data loading, plotting.
[ ]:
%[ -e data.csv ] || wget https://raw.githubusercontent.com/opendp/opendp/main/docs/source/data/PUMS_california_demographics_1000/data.csv
[1]:
import opendp.prelude as dp
dp.enable_features("contrib", "floating-point")
max_influence = 1
budget = (1., 1e-8)
# public information
col_names = ["age", "sex", "educ", "race", "income", "married"]
size = 1000
with open('data.csv') as input_data:
data = input_data.read()
def plot_histogram(sensitive_counts, released_counts):
"""Plot a histogram that compares true data against released data"""
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
fig = plt.figure()
ax = fig.add_axes([1,1,1,1])
plt.ylim([0,225])
tick_spacing = 1.
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.xlim(0,15)
width = .4
ax.bar(list([x+width for x in range(0, len(sensitive_counts))]), sensitive_counts, width=width, label='True Value')
ax.bar(list([x+2*width for x in range(0, len(released_counts))]), released_counts, width=width, label='DP Value')
ax.legend()
plt.title('Histogram of Education Level')
plt.xlabel('Years of Education')
plt.ylabel('Count')
plt.show()
Private histogram via make_count_by_categories
#
This approach is only applicable if the set of potential values that the data may take on is public information. If this information is not available, then use make_count_by
instead. It typically has greater utility than make_count_by
until the size of the category set is greater than dataset size. In this data, we know that the category set is public information: strings consisting of the numbers between 1 and 20.
The counting aggregator computes a vector of counts in the same order as the input categories. It also includes one extra count at the end of the vector, consisting of the number of elements that were not members of the category set.
You’ll notice that make_base_discrete_laplace
has an additional argument that explicitly sets the type of the domain, D
. It defaults to AtomDomain[int]
which works in situations where the mechanism is noising a scalar. However, in this situation, we are noising a vector of scalars, and thus the appropriate domain is VectorDomain[AtomDomain[int]]
.
[5]:
# public information
categories = list(map(str, range(1, 20)))
histogram = (
dp.t.make_split_dataframe(separator=",", col_names=col_names) >>
dp.t.make_select_column(key="educ", TOA=str) >>
# Compute counts for each of the categories and null
dp.t.then_count_by_categories(categories=categories)
)
noisy_histogram = dp.binary_search_chain(
lambda s: histogram >> dp.m.then_laplace(scale=s),
d_in=max_influence, d_out=budget[0])
sensitive_counts = histogram(data)
released_counts = noisy_histogram(data)
print("Educational level counts:\n", sensitive_counts[:-1])
print("DP Educational level counts:\n", released_counts[:-1])
print("DP estimate for the number of records that were not a member of the category set:", released_counts[-1])
plot_histogram(sensitive_counts, released_counts)
Educational level counts:
[33, 14, 38, 17, 24, 21, 31, 51, 201, 60, 165, 76, 178, 54, 24, 13, 0, 0, 0]
DP Educational level counts:
[34, 13, 38, 18, 24, 22, 31, 51, 201, 59, 166, 76, 178, 54, 24, 13, 0, 0, 3]
DP estimate for the number of records that were not a member of the category set: 1
Private histogram via make_count_by
and make_base_laplace_threshold
\n#
This approach is applicable when the set of categories is unknown or very large. The make_count_by
transformation computes a hashmap containing the count of each unique key, and make_base_laplace_threshold
adds noise to the counts and censors counts less than some threshold.:nbsphinx-math:n
On make_base_laplace_threshold
, the noise scale parameter influences the epsilon parameter of the budget, and the threshold influences the delta parameter in the budget. Any category with a count sufficiently small is censored from the release.
It is sometimes referred to as a “stability histogram” because it only releases counts for “stable” categories that exist in all datasets that are considered “neighboring” to your private dataset.
I start out by defining a function that finds the tightest noise scale and threshold for which the stability histogram is (d_in, d_out)
-close.
[6]:
def make_base_laplace_threshold_budget(
preprocess: dp.Transformation,
d_in, d_out
) -> dp.Measurement:
"""Make a stability histogram that respects a given d_in, d_out."""
def privatize(s, t=1e8):
return preprocess >> dp.m.then_base_laplace_threshold(scale=s, threshold=t)
s = dp.binary_search_param(lambda s: privatize(s=s), d_in, d_out)
t = dp.binary_search_param(lambda t: privatize(s=s, t=t), d_in, d_out)
return privatize(s=s, t=t)
I now use the make_base_laplace_threshold_budget
constructor to release a private histogram on the education data.
[7]:
preprocess = (
dp.t.make_split_dataframe(separator=",", col_names=col_names) >>
dp.t.make_select_column(key="educ", TOA=str) >>
dp.t.then_count_by(MO=dp.L1Distance[float], TV=float)
)
noisy_histogram = make_base_laplace_threshold_budget(
preprocess,
d_in=max_influence, d_out=budget)
sensitive_counts = histogram(data)
released_counts = noisy_histogram(data)
# postprocess to make the results easier to compare
postprocessed_counts = {k: round(v) for k, v in released_counts.items()}
print("Educational level counts:\n", sensitive_counts)
print("DP Educational level counts:\n", postprocessed_counts)
def as_array(data):
return [data.get(k, 0) for k in categories]
plot_histogram(sensitive_counts, as_array(released_counts))
Educational level counts:
[33, 14, 38, 17, 24, 21, 31, 51, 201, 60, 165, 76, 178, 54, 24, 13, 0, 0, 0, 0]
DP Educational level counts:
{'3': 39, '10': 60, '1': 32, '11': 160, '13': 178, '7': 31, '14': 56, '5': 23, '8': 51, '12': 77, '6': 23, '9': 201, '15': 24}