{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# First Look at DP\n", "\n", "Differential privacy (DP) is a technique used to release information about a population\n", "in a way that limits the exposure of any one individual's personal information.\n", "\n", "In this notebook, we'll conduct a differentially-private analysis on a teacher survey (a tabular dataset).\n", "\n", "The raw data consists of survey responses from teachers in primary and secondary schools in an unspecified U.S. state." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Why Differential Privacy?\n", "\n", "Protecting the privacy of individuals while sharing information is nontrivial.\n", "Let's say I naively \"anonymized\" the teacher survey by removing the person's name." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"../data/teacher_survey/teacher_survey.csv\")\n", "df.columns = ['name',\n", " 'sex',\n", " 'age',\n", " 'maritalStatus',\n", " 'hasChildren',\n", " 'highestEducationLevel',\n", " 'sourceOfStress',\n", " 'smoker',\n", " 'optimism',\n", " 'lifeSatisfaction',\n", " 'selfEsteem']\n", "\n", "# naively \"anonymize\" by removing the name column\n", "del df[\"name\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It would still be very easy to re-identify individuals via quasi-identifiers.\n", "Say I was curious about my non-binary co-worker Chris, and I knew their age (27)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sex | \n", "age | \n", "maritalStatus | \n", "smoker | \n", "
---|---|---|---|---|
2251 | \n", "3 | \n", "27 | \n", "3 | \n", "1 | \n", "