Questions or feedback?

String#

[Polars Documentation]

In the string module, OpenDP currently only supports parsing to temporal data types.

[10]:
import polars as pl
import opendp.prelude as dp
dp.enable_features("contrib")
# Fetch and unpack the data.
![ -e ../sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip -d ../ )

context = dp.Context.compositor(
    # Many columns contain mixtures of strings and numbers and cannot be parsed as floats,
    # so we'll set `ignore_errors` to true to avoid conversion errors.
    data=pl.scan_csv("../sample_FR_LFS.csv", ignore_errors=True),
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
    split_evenly_over=2,
)

Strptime, To Date, To Datetime, To Time#

Dates can be parsed from strings via .str.strptime, and its variants .str.to_date, .str.to_datetime, and .str.to_time.

[11]:
query = (
    context.query()
    .with_columns(pl.col.YEAR.cast(str).str.to_date(format=r"%Y"))
    .group_by("YEAR")
    .agg(dp.len())
)
query.release().collect().sort("YEAR")
[11]:
shape: (10, 2)
YEARlen
dateu32
2004-01-0116510
2005-01-0116448
2006-01-0116108
2007-01-0116802
2008-01-0116757
2009-01-0119846
2010-01-0124061
2011-01-0124842
2012-01-0124834
2013-01-0123316

While Polars supports automatic inference of the datetime format from reading the data, doing so can lead to situations where the data-dependent inferred format changes or cannot be inferred upon the addition or removal of a single individual, resulting in an unstable computation. For this reason, the format argument is required.

OpenDP also does not allow parsing strings into nanosecond datetimes, because the underlying implementation throws data-dependent errors (not private) for certain inputs.

[12]:
query = (
    context.query()
    .with_columns(pl.col.YEAR.cast(str).str.to_datetime(format=r"%Y", time_unit="ns"))
    .group_by("YEAR")
    .agg(dp.len())
)
try:
    query.release()
    assert False, "unreachable!"
except dp.OpenDPException as err:
    assert "Nanoseconds are not currently supported" in str(err)

Parsed data can then be manipulated with temporal expressions.