Questions or feedback?

Polars vs. OpenDP#

We’ll assume you have created a opendp.context.Context named context. OpenDP Polars differs from typical Polars in these ways:

  1. How you specify the data. Instead of directly manipulating the data (a LazyFrame), you now manipulate an opendp.extras.polars.LazyFrameQuery returned by context.query() You can think of context.query() as a mock for the real data (although in reality, a LazyFrameQuery is an empty LazyFrame with some extra methods).

    >>> #                                 /‾‾‾‾‾‾‾‾‾‾‾‾‾\
    >>> query: dp.polars.LazyFrameQuery = context.query().select(dp.len())
    
  2. How you construct the query. OpenDP extends the Polars API to include differentially private methods and statistics. LazyFrame (now LazyFrameQuery) has additional methods, like .summarize and .release.

    >>> #    /‾‾‾‾‾‾‾‾‾‾\
    >>> query.summarize()
    shape: (1, 4)
    ┌────────┬──────────────┬─────────────────┬───────┐
    │ column ┆ aggregate    ┆ distribution    ┆ scale │
    │ ---    ┆ ---          ┆ ---             ┆ ---   │
    │ str    ┆ str          ┆ str             ┆ f64   │
    ╞════════╪══════════════╪═════════════════╪═══════╡
    │ len    ┆ Frame Length ┆ Integer Laplace ┆ 10.0  │
    └────────┴──────────────┴─────────────────┴───────┘
    

    Expressions also have an additional namespace .dp with methods from opendp.extras.polars.DPExpr.

    >>> candidates = list(range(0, 100, 10))
    >>> _ = context.query().select(
    ...     #                            /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
    ...     pl.col("income").fill_null(0).dp.median(candidates)
    ... )
    
  3. How you run the query. When used from OpenDP, you must first call .release() before executing the computation with .collect(). .release() accounts for the privacy loss of releasing the query, and updates the privacy budget.

    >>> #    /‾‾‾‾‾‾‾‾\
    >>> query.release().collect()
    shape: (1, 1)
    ┌─────┐
    │ len │
    │ --- │
    │ u32 │
    ╞═════╡
    │ ... │
    └─────┘
    
  4. What queries are allowed. OpenDP only makes guarantees about query plans and expressions it knows about. Therefore OpenDP is somewhat like an allow-list on valid query plans.

    To satisfy differential privacy, there are also cases where OpenDP must change the arguments to a Polars expression. Most commonly this is to ensure that failures don’t raise data-dependent errors. OpenDP may also make arguments mandatory (for example, format strings in temporal parsing), or disallow the use of expressions on certain data types (for example, imputation on categorical data).

    These changes in behavior, and the reasoning behind them, are discussed in Expressions.