Advanced Pandas GroupBy and Window Functions for Data Engineering and Analytics

April 18, 2024

python data-engineering data-analysis

Session Outline: Advanced GroupBy and Window Functions in Pandas

1. Introduction to GroupBy and Its Role in Data Engineering (5 minutes)

Overview: Recap of GroupBy as a split-apply-combine paradigm, as defined by Wes McKinney in Python for Data Analysis.
Data Engineering Context: Why GroupBy is critical for ETL pipelines, data aggregation, and analytics workflows.
Key Insight from Experts:
- Wes McKinney emphasizes GroupBy for flexible data aggregation and transformation.
- Matt Harrison advocates for chaining operations and leveraging GroupBy for concise, readable code (Effective Pandas).
Use cases:
- Aggregating transactional data for reporting.
- Feature engineering for machine learning.
- Time-series analysis and cohort analysis.

2. Core GroupBy Operations and Best Practices (10 minutes)

Basic syntax: df.groupby('column')
Aggregation functions: mean(), sum(), count(), min(), max(), std().
Grouping by single/multiple columns, index levels, and functions.
Best Practices (Matt Harrison):
- Use method chaining for clarity.
- Leverage agg() for multiple aggregations.
- Understand the difference between transform() and apply().

3. Advanced GroupBy Techniques (15 minutes)

Custom Aggregation Functions:

def custom_agg(x):
    return pd.Series({
        'mean': x.mean(),
        'std': x.std(),
        'count': x.count()
    })

df.groupby('group').agg(custom_agg)

Filtering Groups: filter()
Transformation: transform() for group-wise operations that return data aligned with the original DataFrame.
Apply: apply() for more complex operations.
Resampling Time Series Data:
```
df.resample('M').mean()
```

4. Window Functions: Rolling, Expanding, and EWM (15 minutes)

Rolling Windows:

df['rolling_avg'] = df['value'].rolling(window=3).mean()

Expanding Windows:

df['expanding_sum'] = df['value'].expanding().sum()

Exponentially Weighted Windows:

df['ewm'] = df['value'].ewm(span=3).mean()

Custom Window Functions:

def custom_window(x):
    return x[-1] - x[0]

df['window_diff'] = df['value'].rolling(window=3).apply(custom_window)

5. Performance Considerations (5 minutes)

Vectorized Operations: Always prefer built-in methods over apply when possible.
Categorical Data: Convert string columns to categories for memory efficiency.
Numba and Cython: For performance-critical sections.
Dask: For out-of-core computations on large datasets.

6. Real-world Examples (10 minutes)

Customer Segmentation:

customer_stats = df.groupby('customer_id').agg({
    'purchase_amount': ['sum', 'mean', 'count'],
    'purchase_date': ['min', 'max']
})

Time-series Analysis:

# 7-day rolling average
df['7d_avg'] = df.groupby('product_id')['sales'].transform(
    lambda x: x.rolling('7D').mean()
)

Cohort Analysis:

cohort = df.groupby(['cohort', 'period']).agg({
    'user_id': 'nunique',
    'revenue': 'sum'
})

7. Common Pitfalls and How to Avoid Them (5 minutes)

Setting with Copy Warning: Use .copy() when creating new DataFrames.
Memory Usage: Be mindful of group sizes.
Performance Bottlenecks: Profile your code with %timeit.
Missing Data: Handle NaN values before grouping.

8. Hands-on Exercise (20 minutes)

Dataset: Sample sales data with date, product, region, and sales amount.
Tasks:
1. Calculate total sales by region and product.
2. Find the 7-day rolling average of sales.
3. Identify the top-performing product in each region.
4. Calculate month-over-month growth rate.

9. Q&A and Wrap-up (5 minutes)

Key Takeaways:
- GroupBy is a powerful tool for data aggregation and transformation.
- Window functions enable sophisticated time-series analysis.
- Always consider performance implications.
Resources:
- Python for Data Analysis by Wes McKinney
- Effective Pandas by Matt Harrison
- Pandas documentation on GroupBy and Window Functions

Example Code Snippets

# Sample data
data = {
    'Date': pd.date_range(start='2023-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100)
}
df = pd.DataFrame(data)

# Group by multiple columns
grouped = df.groupby(['Region', 'Product'])['Sales'].agg(['sum', 'mean', 'count'])

# Rolling window calculation
df['Rolling_7D'] = df.groupby('Product')['Sales'].transform(
    lambda x: x.rolling('7D').mean()
)

# Pivot table for visualization
pivot = pd.pivot_table(
    df, 
    values='Sales', 
    index='Date', 
    columns='Region', 
    aggfunc='sum'
)

This session will provide attendees with practical, hands-on experience with advanced Pandas operations, equipping them with the skills needed to tackle complex data manipulation tasks in their projects.

Advanced Pandas GroupBy and Window Functions for Data Engineering and Analytics

In this article

Session Outline: Advanced GroupBy and Window Functions in Pandas

1. Introduction to GroupBy and Its Role in Data Engineering (5 minutes)

2. Core GroupBy Operations and Best Practices (10 minutes)

3. Advanced GroupBy Techniques (15 minutes)

4. Window Functions: Rolling, Expanding, and EWM (15 minutes)

5. Performance Considerations (5 minutes)

6. Real-world Examples (10 minutes)

7. Common Pitfalls and How to Avoid Them (5 minutes)

8. Hands-on Exercise (20 minutes)

9. Q&A and Wrap-up (5 minutes)

Example Code Snippets

Tags

Related Posts

Part 5 - Plotting Using Seaborn - Radar

Part 2 - Plotting Using Seaborn - Distribution Plot, Facet Grid

Part 3 - Plotting Using Seaborn - Donut