Questions or feedback?

Data Types#

A column/series in Polars contains contiguous data backed by Arrow arrays, as well as a validity bitmap to indicate null values. This means all data types have a means to represent nullity.

OpenDP assumes there can be nulls in all data, .fill_null can be used to remove them, and the lack of nulls is a requirement for certain aggregations.

[1]:
import polars as pl
import opendp.prelude as dp

dp.enable_features("contrib")
![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )

Boolean#

The simplest data type, with no additional domain descriptors.

Integer#

OpenDP Polars supports UInt32, UInt64, Int8, Int16, Int32 and Int64 integer data types (excluding UInt8, UInt16).

OpenDP tracks lower and upper bounds of numeric data types. These bounds can be acquired from .clip or set in the input domain (although this is not recommended), can be lost upon further data processing, and are required to use certain aggregations.

Float#

OpenDP Polars supports Float32 and Float64 float data types.

In addition to bounds, OpenDP also tracks the potential presence of NaN values. OpenDP assumes there can be NaNs in float data, .fill_nan can be used to remove them, and the lack of NaNs is a requirement for certain aggregations.

This means the float aggregations typically require preprocessing by both .fill_null and .fill_nan to impute both nulls and NaNs.

String#

Strings have no domain descriptors, but also take up the most space in memory and are the slowest to work with.

Categorical#

Categorical data appears to be string data, but its underlying data representation is pl.UInt32 indices into an array of string labels. This results in much lower memory usage and a faster runtime. These integer indices can be retrieved via the .to_physical expression.

Unfortunately, there are two limitations to keep in mind:

  • OpenDP forbids expressions that may add or remove categories, because this triggers a data-dependent categorical remapping warning. Side-effects like this do not satisfy differential privacy. This means OpenDP rejects the use of categorical data in, for example, .fill_null and binary expressions.

  • The encoding of categories typically assigns indices according to the order of records in the data. Since revealing row ordering does not satisfy differential privacy, OpenDP only allows categorical group-by columns when the encoding is data-independent. The .cut expression, for example, has a data-independent encoding.

The following code shows the latter limitation in practice:

[2]:
context_categorical = dp.Context.compositor(
    data=pl.LazyFrame([pl.Series("categorical", ["A", "B", "C"] * 1000, dtype=pl.Categorical)]),
    privacy_unit=dp.unit_of(contributions=1),
    privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
    split_evenly_over=1,
)

query = (
    context_categorical.query()
    .group_by("categorical")
    .agg(dp.len())
)
try:
    query.release()
    assert False, "unreachable, should have raised"
except dp.OpenDPException as err:
    # the error we would expect to get:
    assert "Categories are data-dependent" in str(err)

Temporal#

OpenDP supports three kinds of temporal data types: pl.Date, pl.Datetime and pl.Time. Datetimes may also store time zone information, which are considered part of the data domain (all datetimes in a column must share the same time zone), and may internally represent time units in terms of nanoseconds, milliseconds or microseconds.

[3]:
from datetime import time, datetime, date

# data ingest with different kinds of temporal data
context = dp.Context.compositor(
    data=pl.LazyFrame({
        "time":     [time(12, 30), time(1, 0), time(23, 10)] * 1000,
        "datetime": [datetime(2000, 1, 1, hour=12), datetime(2020, 1, 1, hour=12)] * 1500,
        "date":     [date(2000, 1, 1), date(2010, 1, 1), date(2020, 1, 1)] * 1000,
    }),
    privacy_unit=dp.unit_of(contributions=1),
    privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
    split_evenly_over=1,
)

# releasing a private histogram with common times
query = context.query().group_by("time").agg(dp.len())
query.release().collect()
[3]:
shape: (3, 2)
timelen
timeu32
12:30:001003
01:00:001000
23:10:001000

Refer to the expression documentation for ways to parse strings into temporal data and manipulate temporal data with methods in the .dt module.