{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# String\n", "\n", "[[Polars Documentation](https://docs.pola.rs/api/python/stable/reference/expressions/string.html)]\n", "\n", "In the string module, OpenDP currently only supports parsing to temporal data types." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "import opendp.prelude as dp\n", "dp.enable_features(\"contrib\")\n", "\n", "context = dp.Context.compositor(\n", " # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,\n", " # so we'll set `ignore_errors` to true to avoid conversion errors.\n", " data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),\n", " privacy_unit=dp.unit_of(contributions=36),\n", " privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),\n", " split_evenly_over=2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strptime, To Date, To Datetime, To Time\n", "\n", "Dates can be parsed from strings via `.str.strptime`, and its variants `.str.to_date`, `.str.to_datetime`, and `.str.to_time`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 2)
YEARlen
dateu32
2005-01-01342193
2006-01-01339683
2007-01-01350429
2008-01-01348574
2009-01-01416966
2010-01-01500385
2011-01-01517166
2012-01-01515460
2013-01-01480615
" ], "text/plain": [ "shape: (9, 2)\n", "┌────────────┬────────┐\n", "│ YEAR ┆ len │\n", "│ --- ┆ --- │\n", "│ date ┆ u32 │\n", "╞════════════╪════════╡\n", "│ 2005-01-01 ┆ 342193 │\n", "│ 2006-01-01 ┆ 339683 │\n", "│ 2007-01-01 ┆ 350429 │\n", "│ 2008-01-01 ┆ 348574 │\n", "│ 2009-01-01 ┆ 416966 │\n", "│ 2010-01-01 ┆ 500385 │\n", "│ 2011-01-01 ┆ 517166 │\n", "│ 2012-01-01 ┆ 515460 │\n", "│ 2013-01-01 ┆ 480615 │\n", "└────────────┴────────┘" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = (\n", " context.query()\n", " .with_columns(pl.col.YEAR.cast(str).str.to_date(format=r\"%Y\"))\n", " .group_by(\"YEAR\")\n", " .agg(dp.len())\n", ")\n", "query.release().collect().sort(\"YEAR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While Polars supports automatic inference of the datetime format from reading the data,\n", "doing so can lead to situations where the data-dependent inferred format changes or cannot be inferred upon the addition or removal of a single individual, resulting in an unstable computation.\n", "For this reason, the `format` argument is required.\n", "\n", "OpenDP also does not allow parsing strings into nanosecond datetimes, \n", "because the underlying implementation throws data-dependent errors (not private) [for certain inputs](https://github.com/pola-rs/polars/issues/19928)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "query = (\n", " context.query()\n", " .with_columns(pl.col.YEAR.cast(str).str.to_datetime(format=r\"%Y\", time_unit=\"ns\"))\n", " .group_by(\"YEAR\")\n", " .agg(dp.len())\n", ")\n", "try:\n", " query.release()\n", " assert False, \"unreachable!\"\n", "except dp.OpenDPException as err:\n", " assert \"Nanoseconds are not currently supported\" in str(err)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsed data can then be manipulated with [temporal expressions](temporal.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.2" } }, "nbformat": 4, "nbformat_minor": 2 }