{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# String\n", "\n", "[[Polars Documentation](https://docs.pola.rs/api/python/stable/reference/expressions/string.html)]\n", "\n", "In the string module, OpenDP currently only supports parsing to temporal data types." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "import opendp.prelude as dp\n", "dp.enable_features(\"contrib\")\n", "# Fetch and unpack the data. \n", "![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )\n", "\n", "context = dp.Context.compositor(\n", " # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", " # so we'll set `ignore_errors` to true to avoid conversion errors.\n", " data=pl.scan_csv(\"../sample_FR_LFS.csv\", ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),\n", " split_evenly_over=2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strptime, To Date, To Datetime, To Time\n", "\n", "Dates can be parsed from strings via `.str.strptime`, and its variants `.str.to_date`, `.str.to_datetime`, and `.str.to_time`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 2)
YEARlen
dateu32
2004-01-0116510
2005-01-0116448
2006-01-0116108
2007-01-0116802
2008-01-0116757
2009-01-0119846
2010-01-0124061
2011-01-0124842
2012-01-0124834
2013-01-0123316
" ], "text/plain": [ "shape: (10, 2)\n", "┌────────────┬───────┐\n", "│ YEAR ┆ len │\n", "│ --- ┆ --- │\n", "│ date ┆ u32 │\n", "╞════════════╪═══════╡\n", "│ 2004-01-01 ┆ 16510 │\n", "│ 2005-01-01 ┆ 16448 │\n", "│ 2006-01-01 ┆ 16108 │\n", "│ 2007-01-01 ┆ 16802 │\n", "│ 2008-01-01 ┆ 16757 │\n", "│ 2009-01-01 ┆ 19846 │\n", "│ 2010-01-01 ┆ 24061 │\n", "│ 2011-01-01 ┆ 24842 │\n", "│ 2012-01-01 ┆ 24834 │\n", "│ 2013-01-01 ┆ 23316 │\n", "└────────────┴───────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = (\n", " context.query()\n", " .with_columns(pl.col.YEAR.cast(str).str.to_date(format=r\"%Y\"))\n", " .group_by(\"YEAR\")\n", " .agg(dp.len())\n", ")\n", "query.release().collect().sort(\"YEAR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While Polars supports automatic inference of the datetime format from reading the data,\n", "doing so can lead to situations where the data-dependent inferred format changes or cannot be inferred upon the addition or removal of a single individual, resulting in an unstable computation.\n", "For this reason, the `format` argument is required.\n", "\n", "OpenDP also does not allow parsing strings into nanosecond datetimes, \n", "because the underlying implementation throws data-dependent errors (not private) [for certain inputs](https://github.com/pola-rs/polars/issues/19928)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "query = (\n", " context.query()\n", " .with_columns(pl.col.YEAR.cast(str).str.to_datetime(format=r\"%Y\", time_unit=\"ns\"))\n", " .group_by(\"YEAR\")\n", " .agg(dp.len())\n", ")\n", "try:\n", " query.release()\n", " assert False, \"unreachable!\"\n", "except dp.OpenDPException as err:\n", " assert \"Nanoseconds are not currently supported\" in str(err)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsed data can then be manipulated with [temporal expressions](temporal.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 2 }