Polars Series

From Pandas to Polars: A Paradigm Shift in DataFrame Processing

Learn the fundamental differences between Pandas and Polars, understand core concepts unique to Polars, and get productive with minimal friction.

From Pandas to Polars: A Paradigm Shift in DataFrame Processing

Open In Colab   Download Notebook

Welcome to the first installment of our Polars blog series! If you’ve spent years mastering Pandas and are curious about what makes Polars the talk of the data community, this post is your gateway. We’ll explore the fundamental differences between these libraries, understand the core concepts unique to Polars, and get you productive with minimal friction.

The bottom line is straightforward: Polars is not just a faster Pandas—it’s a fundamentally different approach to DataFrame operations that prioritizes performance, parallelism, and a declarative expression-based paradigm.


Why Polars? The Performance Promise

Before diving into syntax, let’s understand why Polars exists. Pandas, despite being the workhorse of Python data analysis, carries inherent limitations: it’s single-threaded, eagerly executes operations, and requires 5–10× more RAM than the dataset size for typical operations.

Polars addresses these limitations through:

  • Rust Foundation: Written in Rust, achieving C/C++ level performance without Python’s runtime overhead
  • Parallelization by Default: Utilizes all CPU cores automatically—no configuration needed
  • Apache Arrow Memory Model: More efficient than NumPy arrays, especially for string and categorical data
  • Query Optimization: Analyzes your entire query plan before execution to eliminate redundant work
Pandas LimitationPolars Solution
Single-threaded executionAutomatic multi-core parallelization
Eager execution onlyBoth eager and lazy evaluation modes
High memory overhead (5–10× dataset size)Lower memory footprint (2–4× dataset size)
Index-based row accessIndex-free design for simpler data manipulation
Type coercion on missing dataStrict type system with explicit handling

Concept 1: The Expression-Based Paradigm

This is the most fundamental shift you’ll encounter. In Pandas, you typically manipulate columns directly through assignment. In Polars, you describe what you want through expressions that execute inside contexts.

An expression in Polars is a lazy representation of a data transformation—it doesn’t do anything until placed within a context.

Code Comparison:

1. Setup (Mandatory for reproducibility):

import pandas as pd
import polars as pl

# Create identical DataFrames
data = {'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 92, 78]}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code:

# Adding a new column - direct assignment
pd_df['score_doubled'] = pd_df['score'] * 2
pd_df['passed'] = pd_df['score'] >= 80

3. Polars Code:

# Adding new columns - expression-based approach
pl_df = pl_df.with_columns(
    (pl.col("score") * 2).alias("score_doubled"),
    (pl.col("score") >= 80).alias("passed")
)

The key takeaway is that pl.col("score") is an expression object that only evaluates when passed to a context like .with_columns(). Multiple expressions within the same context execute in parallel automatically.


Concept 2: The Four Essential Contexts

Contexts are methods that accept expressions and apply them to your data. Mastering these four contexts covers 90% of your data manipulation needs:

Pandas ApproachPolars ContextPurpose
df[['col1', 'col2']] or df.loc[:, cols].select()Choose and transform specific columns
df['new_col'] = ....with_columns()Add or modify columns while keeping all existing ones
df[df['col'] > value].filter()Select rows based on conditions
df.groupby().agg().group_by().agg()Aggregate data by groups

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl

data = {
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'salary': [50000, 60000, 75000, 80000]
}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code:

# Select columns
result = pd_df[['department', 'salary']]

# Filter rows
result = pd_df[pd_df['salary'] > 55000]

# Group and aggregate
result = pd_df.groupby('department')['salary'].agg(['mean', 'max']).reset_index()

3. Polars Code:

# Select columns
result = pl_df.select("department", "salary")
# Or with expressions: pl_df.select(pl.col("department"), pl.col("salary"))

# Filter rows
result = pl_df.filter(pl.col("salary") > 55000)

# Group and aggregate
result = pl_df.group_by("department").agg(
    pl.col("salary").mean().alias("salary_mean"),
    pl.col("salary").max().alias("salary_max")
)

The main takeaway is that in Polars, aggregations are always explicit expressions inside .agg(). Each expression can be independently parallelized, and you must use .alias() to name your computed columns.


Concept 3: No More Index—And That’s a Good Thing

One of the most liberating aspects of Polars is the deliberate absence of a row index. In Pandas, the index often creates confusion—should you use .loc or .iloc? What happens to the index after a merge? Polars eliminates this complexity entirely.

Pandas ConceptPolars Equivalent
df.set_index('col')Not needed—use columns directly
df.loc[index_value]df.filter(pl.col("col") == value)
df.ilocdf.row(5) or df (eager only)
df.reset_index()Not needed—no index to reset

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl

data = {'id': [101, 102, 103], 'value': [10, 20, 30]}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code:

# Set index and access by index value
pd_df_indexed = pd_df.set_index('id')
result = pd_df_indexed.loc  # Returns a Series

3. Polars Code:

# Direct filtering - no index needed
result = pl_df.filter(pl.col("id") == 102)  # Returns a DataFrame

The main takeaway is that Polars treats row selection as a filtering operation. This makes data manipulation more predictable and eliminates the mental overhead of managing index states.


Concept 4: Strict Data Types—No Silent Conversions

Polars is strict about data types. Unlike Pandas, which might silently convert an integer column to float when you introduce NaN values, Polars maintains type integrity.

Pandas BehaviorPolars Behavior
Integer column becomes float with NaNInteger column stays integer; nulls are null
Implicit type coercion in operationsExplicit casting required
Mixed types allowed in object columnsHomogeneous types enforced per column

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl
import numpy as np

# Pandas with NaN
pd_df = pd.DataFrame({'values': [1, 2, np.nan, 4]})
print(pd_df.dtypes)  # float64 - silently converted!

# Polars with null
pl_df = pl.DataFrame({'values': [1, 2, None, 4]})
print(pl_df.schema)  # {'values': Int64} - stays integer!

The main takeaway is that Polars’ strict type system catches potential bugs early and ensures predictable behavior. When you need type conversion, use .cast() explicitly.


Concept 5: Lazy Evaluation—Your Secret Weapon

This is where Polars truly shines. While Pandas executes each operation immediately (eager evaluation), Polars offers a lazy API that builds a query plan first, optimizes it, and only executes when you call .collect().

Pandas (Eager)Polars Lazy API
pd.read_csv()pl.scan_csv() → returns LazyFrame
Executes immediatelyBuilds query plan, executes on .collect()
No optimization possiblePredicate pushdown, projection pushdown, query rewriting
Must load entire datasetCan process larger-than-memory data with streaming

Code Comparison:

1. Pandas Code (Eager):

import pandas as pd

# Every operation executes immediately
df = pd.read_csv("large_file.csv")        # Loads entire file
df = df[df['status'] == 'active']          # Filters after loading everything
df = df[['id', 'name', 'value']]           # Selects after filtering
result = df.groupby('name')['value'].sum()

2. Polars Code (Lazy):

import polars as pl

# Build query plan - nothing executes yet
query = (
    pl.scan_csv("large_file.csv")          # Lazy scan - no data loaded
    .filter(pl.col("status") == "active")  # Will push filter to scan level
    .select("id", "name", "value")         # Will only read these columns
    .group_by("name")
    .agg(pl.col("value").sum())
)

# Execute the optimized query
result = query.collect()

The lazy API enables powerful optimizations:

  • Predicate Pushdown: Filters are applied during file reading, not after loading everything into memory
  • Projection Pushdown: Only necessary columns are read from disk
  • Common Subexpression Elimination: Duplicate computations are identified and executed once
  • Query Rewriting: Operations are reordered for efficiency

You can inspect the query plan before execution:

# See the optimized plan
print(query.explain())

The main takeaway is to default to lazy mode (pl.scan_csv(), .lazy()) for any non-trivial data pipeline. The query optimizer often produces 2–10× performance improvements by eliminating unnecessary work.


Concept 6: Conditional Logic with When-Then-Otherwise

Pandas users often reach for np.where() or df.apply() for conditional logic. Polars provides a more readable and performant pattern: pl.when().then().otherwise().

Pandas ApproachPolars Equivalent
np.where(condition, value_if_true, value_if_false)pl.when(condition).then(value_if_true).otherwise(value_if_false)
df['col'].apply(lambda x: ...)Chain multiple .when().then() clauses

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl
import numpy as np

data = {'score': [45, 65, 85, 92, 55]}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code:

# Using np.where for simple condition
pd_df['grade'] = np.where(pd_df['score'] >= 60, 'Pass', 'Fail')

# Using nested np.where for multiple conditions
pd_df['letter_grade'] = np.where(
    pd_df['score'] >= 90, 'A',
    np.where(pd_df['score'] >= 80, 'B',
    np.where(pd_df['score'] >= 70, 'C',
    np.where(pd_df['score'] >= 60, 'D', 'F'))))
)

3. Polars Code:

# Readable chained conditions
pl_df = pl_df.with_columns(
    pl.when(pl.col("score") >= 60)
      .then(pl.lit("Pass"))
      .otherwise(pl.lit("Fail"))
      .alias("grade"),
    
    pl.when(pl.col("score") >= 90).then(pl.lit("A"))
      .when(pl.col("score") >= 80).then(pl.lit("B"))
      .when(pl.col("score") >= 70).then(pl.lit("C"))
      .when(pl.col("score") >= 60).then(pl.lit("D"))
      .otherwise(pl.lit("F"))
      .alias("letter_grade")
)

The main takeaway is that when-then-otherwise chains read like natural language and are fully parallelized. Use pl.lit() to wrap literal values in expressions.


Concept 7: Window Functions with .over()

Window functions in Pandas typically require groupby().transform(). Polars uses the more intuitive .over() modifier on any expression.

Pandas ApproachPolars Equivalent
df.groupby('col')['val'].transform('mean')pl.col("val").mean().over("col")
df.groupby('col')['val'].transform(lambda x: x - x.mean())(pl.col("val") - pl.col("val").mean()).over("col")

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl

data = {
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Sales'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'salary': [50000, 60000, 75000, 80000, 55000]
}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code:

# Add department average salary as a new column
pd_df['dept_avg_salary'] = pd_df.groupby('department')['salary'].transform('mean')

# Calculate deviation from department mean
pd_df['salary_deviation'] = pd_df['salary'] - pd_df.groupby('department')['salary'].transform('mean')

3. Polars Code:

# Window functions with .over()
pl_df = pl_df.with_columns(
    pl.col("salary").mean().over("department").alias("dept_avg_salary"),
    (pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_deviation")
)

The main takeaway is that .over() is Polars’ window function mechanism. It keeps all rows (unlike .group_by().agg()) and broadcasts the aggregation result back to each row within its partition.


Concept 8: Avoid apply() and map_elements() When Possible

In Pandas, .apply() is the escape hatch for custom logic. In Polars, map_elements() exists but should be avoided when possible—it’s slow because it breaks out of Polars’ optimized execution engine.

Pandas PatternPolars Best Practice
df['col'].apply(str.upper)pl.col("col").str.to_uppercase()
df['col'].apply(lambda x: x ** 2)pl.col("col").pow(2) or pl.col("col") ** 2
df['col'].apply(custom_function)Use expressions; map_elements() as last resort

Code Comparison:

1. Setup:

import pandas as pd
import polars as pl

data = {'text': ['hello', 'world', 'polars'], 'value': [1, 2, 3]}
pd_df = pd.DataFrame(data)
pl_df = pl.DataFrame(data)

2. Pandas Code (using apply):

pd_df['text_upper'] = pd_df['text'].apply(str.upper)
pd_df['value_squared'] = pd_df['value'].apply(lambda x: x ** 2)

3. Polars Code (using native expressions):

# Preferred: Use native expression methods
pl_df = pl_df.with_columns(
    pl.col("text").str.to_uppercase().alias("text_upper"),
    (pl.col("value") ** 2).alias("value_squared")
)

# Last resort only: map_elements for truly custom logic
# pl_df = pl_df.with_columns(
#     pl.col("value").map_elements(custom_func, return_dtype=pl.Int64).alias("result")
# )

The main takeaway is to explore Polars’ rich expression API (string methods, list methods, datetime methods, etc.) before resorting to map_elements(). The expression API is fully parallelized; map_elements() runs on a single thread.


Concept 9: Method Chaining for Readable Pipelines

Both Pandas and Polars support method chaining, but Polars’ design makes it especially natural. Every operation returns a new DataFrame/LazyFrame, making immutable, functional-style pipelines the default.

Code Comparison:

1. Pandas Code:

import pandas as pd

result = (
    pd.read_csv("sales.csv")
    .query("region == 'North'")
    .assign(revenue_per_unit=lambda df: df['revenue'] / df['units'])
    .groupby('product')
    .agg({'revenue': 'sum', 'units': 'sum'})
    .reset_index()
    .sort_values('revenue', ascending=False)
)

2. Polars Code:

import polars as pl

result = (
    pl.scan_csv("sales.csv")
    .filter(pl.col("region") == "North")
    .with_columns(
        (pl.col("revenue") / pl.col("units")).alias("revenue_per_unit")
    )
    .group_by("product")
    .agg(
        pl.col("revenue").sum(),
        pl.col("units").sum()
    )
    .sort("revenue", descending=True)
    .collect()
)

The main takeaway is that Polars encourages building complete query pipelines before execution. Combined with lazy evaluation, this allows the query optimizer to see your entire transformation and optimize accordingly.


Quick Reference: Pandas to Polars Translation Table

OperationPandasPolars
Read CSV (eager)pd.read_csv()pl.read_csv()
Read CSV (lazy)N/Apl.scan_csv()
Select columnsdf[['a', 'b']]df.select("a", "b")
Add/modify columnsdf['new'] = exprdf.with_columns(expr.alias("new"))
Filter rowsdf[df['a'] > 5]df.filter(pl.col("a") > 5)
Group + aggregatedf.groupby('a').agg({'b': 'sum'})df.group_by("a").agg(pl.col("b").sum())
Sortdf.sort_values('a')df.sort("a")
Rename columnsdf.rename(columns={'a': 'b'})df.rename({"a": "b"})
Drop columnsdf.drop(columns=['a'])df.drop("a")
Joinpd.merge(df1, df2, on='key')df1.join(df2, on="key")
Null checkdf['a'].isna()pl.col("a").is_null()
Fill nullsdf['a'].fillna(0)pl.col("a").fill_null(0)

What’s Next in This Series

This introductory post covered the foundational concepts that differentiate Polars from Pandas. In upcoming posts, we’ll dive deeper into:

  • Session 2: Advanced expressions—string manipulation, datetime operations, and list columns
  • Session 3: Joins, concatenation, and data reshaping (melt, pivot, unpivot)
  • Session 4: Performance optimization—lazy queries, streaming, and memory management
  • Session 5: Real-world data pipelines—combining everything for production workflows

The key paradigm shift to internalize is this: In Polars, think in expressions inside contexts, default to lazy evaluation, and let the query optimizer handle the performance.


Getting Started

Install Polars and start experimenting:

pip install polars
import polars as pl

# Your first Polars DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NYC", "LA", "Chicago"]
})

# Your first expression pipeline
result = df.with_columns(
    (pl.col("age") + 5).alias("age_in_5_years"),
    pl.col("city").str.to_uppercase().alias("city_upper")
).filter(
    pl.col("age") > 26
)

print(result)

Welcome to the world of blazingly fast DataFrames. The learning curve is worth it.

References