Mastering Polars Data Types and Missing Values
Polars is a lightning-fast DataFrame library for Rust and Python. One of its strengths lies in its strict and expressive type system. Understanding how Polars handles data types, especially nested ones, and how it manages missing data is crucial for writing efficient and bug-free data pipelines.
In this post, we’ll explore Polars’ core data structures, dive into nested types like Arrays and Lists, and demystify the difference between null and NaN.
Core Data Structures
Polars is built around three main structures:
- Series: A named, one-dimensional array of data. All elements in a Series must have the same data type.
- DataFrame: A two-dimensional table consisting of multiple Series (columns).
- LazyFrame: A representation of a query plan. Operations on a LazyFrame are not executed immediately. Instead, Polars optimizes the entire plan before execution, leading to significant performance gains.
Nested Data Types
Polars supports complex nested data types, allowing you to work with hierarchical data efficiently.
1. Polars Array (pl.Array)
The Array type represents fixed-size lists. Every element in an Array column must have the exact same number of items. This constraint allows Polars to store the data more efficiently in memory compared to variable-length lists.
import polars as pl
coordinates = pl.DataFrame(
[
pl.Series('point2d', [[1, 3], [2, 3]]),
],
schema={
'point2d': pl.Array(shape=2, inner=pl.Int64),
}
)
2. Polars List (pl.List)
The List type is more flexible, allowing for variable-length arrays. This is ideal for data like “daily temperature readings” where some days might have more readings than others.
weather_readings = pl.DataFrame({
"temperature": [[72.5, 75.0, 77.3], [68.0, 70.2]],
})
3. Polars Struct (pl.Struct)
A Struct is essentially a dictionary nested within a cell. It contains named fields, each with its own data type.
rating_series = pl.Series(
"rating",
[
{"Movies": "Cars", "Theater": "NE", "Avg_rating": 4.5},
{"Movies": "Toy Story", "Theater": "ME", "Avg_rating": 4.9},
]
)
Handling Missing Data: null vs NaN
A common source of confusion in data analysis is the difference between “missing” and “undefined” data. Polars makes a clear distinction:
null: Represents missing data. It indicates that the value is absent. This concept applies to all data types (Integers, Strings, Lists, etc.).NaN(Not a Number): A special floating-point value defined by the IEEE 754 standard. It represents an undefined or unrepresentable result, such as0/0orsqrt(-1). It only applies to floating-point columns.
Key Differences
- Aggregations:
nullvalues are typically ignored in aggregations (e.g.,mean()calculates the mean of existing values).NaNvalues propagate; if a column contains aNaN, the sum or mean will also beNaN. - Filling: You use
fill_null()to handle missing data andfill_nan()to handle floating-point anomalies.
import numpy as np
df = pl.DataFrame({
"value": [1.0, np.nan, None, 4.0]
})
# Check for values
df.with_columns(
is_nan = pl.col("value").is_nan(),
is_null = pl.col("value").is_null()
)
Data Type Conversion
Converting between data types is done using the .cast() method.
Strictness
By default, Polars is strict. If a cast is ambiguous or would result in data loss (that isn’t explicitly allowed), it will raise an error. For example, trying to cast the string “abc” to an Integer will fail.
You can relax this behavior by setting strict=False. In this mode, values that cannot be converted are replaced with null.
df = pl.DataFrame({"val": ["1", "2", "a"]})
# This replaces "a" with null instead of raising an error
df.select(pl.col("val").cast(pl.Int64, strict=False))
Conclusion
Polars’ strict type system and explicit handling of missing values are designed to prevent silent bugs in your data pipelines. By understanding the nuances of Array vs List and null vs NaN, you can leverage the full power of Polars for your data engineering tasks.