{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparing Microdata\n",
    "\n",
    "Data is seldom already in the form you need it in.\n",
    "We use Polars _expressions_ to describe how to build new columns\n",
    "and Polars _contexts_ to describe how those expressions are applied to your data.\n",
    "More information can be found in the [Polars User Guide](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#group_by-and-aggregations).\n",
    "\n",
    "This section explains OpenDP's supported contexts for preparing microdata (column addition and filtering)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl\n",
    "import opendp.prelude as dp\n",
    "dp.enable_features(\"contrib\")\n",
    "\n",
    "context = dp.Context.compositor(\n",
    "    # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n",
    "    # so we'll set `ignore_errors` to true to avoid conversion errors.\n",
    "    data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),\n",
    "    privacy_unit=dp.unit_of(contributions=36),\n",
    "    privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),\n",
    "    split_evenly_over=4,\n",
    "    margins=[dp.polars.Margin(max_partition_length=60_000_000 * 36)]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "Previous documentation sections cover the `.select` context for aggregation and the `.agg` context for aggregation.\n",
    "OpenDP allows expressions used in the `.select` context and `.agg` context to change the number and order of rows,\n",
    "whereas expressions used in the `.with_columns` context, `.filter` context and `.group_by` context must be row-by-row.\n",
    "\n",
    "## With Columns\n",
    "\n",
    "[[Polars Documentation](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#with_columns)]\n",
    "\n",
    "`.with_columns` resolves each passed expression to a column and then adds those columns to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (4, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>HWUSUAL</th><th>len</th></tr><tr><td>cat</td><td>u32</td></tr></thead><tbody><tr><td>&quot;(0, 20]&quot;</td><td>5795</td></tr><tr><td>&quot;(20, 40]&quot;</td><td>53695</td></tr><tr><td>&quot;(40, 60]&quot;</td><td>15884</td></tr><tr><td>&quot;(98, inf]&quot;</td><td>120157</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (4, 2)\n",
       "┌───────────┬────────┐\n",
       "│ HWUSUAL   ┆ len    │\n",
       "│ ---       ┆ ---    │\n",
       "│ cat       ┆ u32    │\n",
       "╞═══════════╪════════╡\n",
       "│ (0, 20]   ┆ 5795   │\n",
       "│ (20, 40]  ┆ 53695  │\n",
       "│ (40, 60]  ┆ 15884  │\n",
       "│ (98, inf] ┆ 120157 │\n",
       "└───────────┴────────┘"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_hwusual_binned = (\n",
    "    context.query()\n",
    "    # shadows the usual work hours \"HWUSUAL\" column with binned data\n",
    "    .with_columns(pl.col.HWUSUAL.cut(breaks=[0, 20, 40, 60, 80, 98]))\n",
    "    .group_by(pl.col.HWUSUAL)\n",
    "    .agg(dp.len())\n",
    ")\n",
    "query_hwusual_binned.release().collect().sort(\"HWUSUAL\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To ensure that the privacy unit remains meaningful, expressions passed into `.with_columns` must be row-by-row, \n",
    "meaning that the expression could be represented as a function applied to each row in the data.\n",
    "The row-by-row property implies that the expression doesn't break the alignment between individual contributions in the data \n",
    "and their individual contributions in the new constructed columns.\n",
    "\n",
    "Another consideration is that any new columns added by `.with_columns` do not (currently) have margin descriptors.\n",
    "For instance, in the above query, any margin descriptors related to `HWUSUAL` would no longer apply to the new, shadowing, \n",
    "`HWUSUAL` column after `.with_columns`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select\n",
    "\n",
    "[[Polars Documentation](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#select)]\n",
    "\n",
    "`.select` resolves each passed expression to a column and then returns those columns.\n",
    "The behavior is the same as `.with_columns`, but only the columns specified in expressions will remain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group By\n",
    "\n",
    "[Polars Documentation](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#group_by-and-aggregations)\n",
    "\n",
    "`.group_by` also resolves each passed expression to a column, and then groups on those columns.\n",
    "Just like `.select` and `.with_columns`, these expressions must be row-by-row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (4, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>HWUSUAL</th><th>len</th></tr><tr><td>cat</td><td>u32</td></tr></thead><tbody><tr><td>&quot;(0, 20]&quot;</td><td>6318</td></tr><tr><td>&quot;(20, 40]&quot;</td><td>54107</td></tr><tr><td>&quot;(40, 60]&quot;</td><td>15034</td></tr><tr><td>&quot;(98, inf]&quot;</td><td>120217</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (4, 2)\n",
       "┌───────────┬────────┐\n",
       "│ HWUSUAL   ┆ len    │\n",
       "│ ---       ┆ ---    │\n",
       "│ cat       ┆ u32    │\n",
       "╞═══════════╪════════╡\n",
       "│ (0, 20]   ┆ 6318   │\n",
       "│ (20, 40]  ┆ 54107  │\n",
       "│ (40, 60]  ┆ 15034  │\n",
       "│ (98, inf] ┆ 120217 │\n",
       "└───────────┴────────┘"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_hwusual_binned = (\n",
    "    context.query()\n",
    "    .group_by(pl.col.HWUSUAL.cut([0, 20, 40, 60, 80, 98]))\n",
    "    .agg(dp.len())\n",
    ")\n",
    "query_hwusual_binned.release().collect().sort(\"HWUSUAL\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the same query as shown above, but with the binning moved into the group by context."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter\n",
    "\n",
    "[[Polars Documentation](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#filter)]\n",
    "\n",
    "`.filter` uses row-by-row expressions of booleans to mask rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (1, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>HWUSUAL</th></tr><tr><td>i64</td></tr></thead><tbody><tr><td>12535165</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (1, 1)\n",
       "┌──────────┐\n",
       "│ HWUSUAL  │\n",
       "│ ---      │\n",
       "│ i64      │\n",
       "╞══════════╡\n",
       "│ 12535165 │\n",
       "└──────────┘"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_total_hours_worked = (\n",
    "    context.query()\n",
    "    .with_columns(pl.col.HWUSUAL.cast(int).fill_null(0))\n",
    "    .filter(pl.col.HWUSUAL > 0)\n",
    "    .select(pl.col.HWUSUAL.dp.sum((0, 80)))\n",
    ")\n",
    "query_total_hours_worked.release().collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filtering discards _all_ `public_info` invariants about the partition keys and partition sizes.\n",
    "Margin descriptors are considered applicable for the input dataset, \n",
    "so a data-dependent filtering renders these invariants invalid.\n",
    "\n",
    "Otherwise, filtering preserves all other margin descriptors, \n",
    "because filtering only ever removes rows."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}