{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregation: Mean\n", "\n", "Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to \"contrib\".\n", "Please contact us if you are interested in proof-writing. Thank you!" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "from opendp.mod import enable_features\n", "enable_features(\"contrib\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Known Dataset Size\n", "The much easier case to consider is when the dataset size is known:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.0" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from opendp.transformations import make_sized_bounded_mean\n", "sb_mean_trans = make_sized_bounded_mean(size=10, bounds=(0., 10.))\n", "sb_mean_trans([5.] * 10)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The sensitivity of this transformation is the same as in `make_sized_bounded_sum`, but divided by `size`.\n", "\n", "That is, $map(d_{in}) = (d_{in} // 2) \\cdot max(|L|, U) / size$, where $//$ denotes integer division with truncation." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0000000000000169" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# since we are in the bounded-DP model, d_in should be a multiple of 2, \n", "# because it takes one removal and one addition to change one record\n", "sb_mean_trans.map(2)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that this operation does not divide by the length of the input data, \n", "it divides by the size parameter passed to the constructor.\n", "As in any other context, it is expected that the data passed into the function is a member of the input domain." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sb_mean_trans = make_sized_bounded_mean(size=10, bounds=(0., 10.))\n", "sb_mean_trans([5.])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Unknown Dataset Size\n", "\n", "There are several approaches for releasing the mean when the dataset size is unknown.\n", "\n", "The first approach is to use the resize transformation.\n", "You can separately release an estimate for the dataset size, and then preprocess the dataset with a resize transformation." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.239477071130359" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from opendp.transformations import make_count, make_clamp, make_bounded_resize\n", "from opendp.measurements import make_base_discrete_laplace, make_base_laplace\n", "\n", "data = [5.] * 10\n", "bounds = (0., 10.)\n", "count_meas = make_count(TIA=float) >> make_base_discrete_laplace(1.)\n", "\n", "dp_count = count_meas(data)\n", "\n", "mean_meas = (\n", " make_clamp(bounds) >>\n", " make_bounded_resize(dp_count, bounds, constant=5.) >> \n", " make_sized_bounded_mean(dp_count, bounds) >>\n", " make_base_laplace(1.)\n", ")\n", "\n", "mean_meas(data)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The total privacy expenditure is the composition of the `count_meas` and `mean_meas` releases." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.000000000000017" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from opendp.combinators import make_basic_composition\n", "make_basic_composition([count_meas, mean_meas]).map(1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Another approach is to compute the DP sum and DP count, and then postprocess the output." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dp mean: 4.4093953568039455\n", "epsilon: 2.000000009313226\n" ] } ], "source": [ "from opendp.transformations import make_bounded_sum\n", "dp_fraction_meas = make_basic_composition([\n", " make_clamp(bounds) >> make_bounded_sum(bounds) >> make_base_laplace(10.),\n", " make_count(TIA=float) >> make_base_discrete_laplace(1.)\n", "])\n", "\n", "dp_sum, dp_count = dp_fraction_meas(data)\n", "print(\"dp mean:\", dp_sum / dp_count)\n", "print(\"epsilon:\", dp_fraction_meas.map(1))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The same approaches are valid for the variance estimator.\n", "The [Unknown Dataset Size notebook](../../examples/unknown-dataset-size.ipynb) goes into greater detail on the tradeoffs of these approaches." ] } ], "metadata": { "kernelspec": { "display_name": "psi", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "3220da548452ac41acb293d0d6efded0f046fab635503eb911c05f743e930f34" } } }, "nbformat": 4, "nbformat_minor": 2 }