Polars vs. OpenDP#
We’ll assume you have created a opendp.context.Context
named context
.
OpenDP Polars differs from typical Polars in these ways:
How you specify the data. Instead of directly manipulating the data (a
LazyFrame
), you now manipulate anopendp.extras.polars.LazyFrameQuery
returned bycontext.query()
You can think ofcontext.query()
as a mock for the real data (although in reality, aLazyFrameQuery
is an emptyLazyFrame
with some extra methods).>>> # /‾‾‾‾‾‾‾‾‾‾‾‾‾\ >>> query: dp.polars.LazyFrameQuery = context.query().select(dp.len())
How you construct the query. OpenDP extends the Polars API to include differentially private methods and statistics.
LazyFrame
(nowLazyFrameQuery
) has additional methods, like.summarize
and.release
.>>> # /‾‾‾‾‾‾‾‾‾‾\ >>> query.summarize() shape: (1, 4) ┌────────┬──────────────┬─────────────────┬───────┐ │ column ┆ aggregate ┆ distribution ┆ scale │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 │ ╞════════╪══════════════╪═════════════════╪═══════╡ │ len ┆ Frame Length ┆ Integer Laplace ┆ 10.0 │ └────────┴──────────────┴─────────────────┴───────┘
Expressions also have an additional namespace
.dp
with methods fromopendp.extras.polars.DPExpr
.>>> candidates = list(range(0, 100, 10)) >>> _ = context.query().select( ... # /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\ ... pl.col("income").fill_null(0).dp.median(candidates) ... )
How you run the query. When used from OpenDP, you must first call
.release()
before executing the computation with.collect()
..release()
accounts for the privacy loss of releasing the query, and updates the privacy budget.>>> # /‾‾‾‾‾‾‾‾\ >>> query.release().collect() shape: (1, 1) ┌─────┐ │ len │ │ --- │ │ u32 │ ╞═════╡ │ ... │ └─────┘
What queries are allowed. OpenDP only makes guarantees about query plans and expressions it knows about. Therefore OpenDP is somewhat like an allow-list on valid query plans.
To satisfy differential privacy, there are also cases where OpenDP must change the arguments to a Polars expression. Most commonly this is to ensure that failures don’t raise data-dependent errors. OpenDP may also make arguments mandatory (for example, format strings in temporal parsing), or disallow the use of expressions on certain data types (for example, imputation on categorical data).
These changes in behavior, and the reasoning behind them, are discussed in Expressions.