String#
In the string module, OpenDP currently only supports parsing to temporal data types.
[10]:
import polars as pl
import opendp.prelude as dp
dp.enable_features("contrib")
# Fetch and unpack the data.
![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )
context = dp.Context.compositor(
# Many columns contain mixtures of strings and numbers and cannot be parsed as floats,
# so we'll set `ignore_errors` to true to avoid conversion errors.
data=pl.scan_csv("../sample_FR_LFS.csv", ignore_errors=True),
privacy_unit=dp.unit_of(contributions=36),
privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
split_evenly_over=2,
)
Strptime, To Date, To Datetime, To Time#
Dates can be parsed from strings via .str.strptime
, and its variants .str.to_date
, .str.to_datetime
, and .str.to_time
.
[11]:
query = (
context.query()
.with_columns(pl.col.YEAR.cast(str).str.to_date(format=r"%Y"))
.group_by("YEAR")
.agg(dp.len())
)
query.release().collect().sort("YEAR")
[11]:
YEAR | len |
---|---|
date | u32 |
2004-01-01 | 16510 |
2005-01-01 | 16448 |
2006-01-01 | 16108 |
2007-01-01 | 16802 |
2008-01-01 | 16757 |
2009-01-01 | 19846 |
2010-01-01 | 24061 |
2011-01-01 | 24842 |
2012-01-01 | 24834 |
2013-01-01 | 23316 |
While Polars supports automatic inference of the datetime format from reading the data, doing so can lead to situations where the data-dependent inferred format changes or cannot be inferred upon the addition or removal of a single individual, resulting in an unstable computation. For this reason, the format
argument is required.
OpenDP also does not allow parsing strings into nanosecond datetimes, because the underlying implementation throws data-dependent errors (not private) for certain inputs.
[12]:
query = (
context.query()
.with_columns(pl.col.YEAR.cast(str).str.to_datetime(format=r"%Y", time_unit="ns"))
.group_by("YEAR")
.agg(dp.len())
)
try:
query.release()
assert False, "unreachable!"
except dp.OpenDPException as err:
assert "Nanoseconds are not currently supported" in str(err)
Parsed data can then be manipulated with temporal expressions.