Pyspark Series

PySpark Datetime Cheatsheet — For Pandas Users

A side-by-side Pandas vs PySpark reference for every common datetime operation: casting, extracting, filtering, arithmetic, date spines, and common gotchas.

PySpark Datetime Cheatsheet — For Pandas Users

All examples use:

from pyspark.sql import functions as F
import pandas as pd

1. Casting Strings to Dates / Timestamps

Pandas

df["CreatedDate"] = pd.to_datetime(df["CreatedDate"])
df["date_only"]   = pd.to_datetime(df["date_only"]).dt.date

Spark

# Full timestamp
df = df.withColumn("CreatedDate", F.to_timestamp("CreatedDate"))

# With custom format
df = df.withColumn("CreatedDate", F.to_timestamp("CreatedDate", "yyyy-MM-dd HH:mm:ss"))

# Date only (no time part)
df = df.withColumn("date_only", F.to_date("date_only", "yyyy-MM-dd"))

Note: Spark uses Java date patterns (yyyy, MM, dd, HH, mm, ss) not Python’s %Y-%m-%d.


2. Extracting Date Parts

Pandas

df["year"]  = df["CreatedDate"].dt.year
df["month"] = df["CreatedDate"].dt.month
df["day"]   = df["CreatedDate"].dt.day
df["hour"]  = df["CreatedDate"].dt.hour
df["dow"]   = df["CreatedDate"].dt.dayofweek   # 0=Monday

Spark

df = (df
    .withColumn("year",    F.year("CreatedDate"))
    .withColumn("month",   F.month("CreatedDate"))
    .withColumn("day",     F.dayofmonth("CreatedDate"))
    .withColumn("hour",    F.hour("CreatedDate"))
    .withColumn("dow",     F.dayofweek("CreatedDate"))    # 1=Sunday, 7=Saturday
    .withColumn("weeknum", F.weekofyear("CreatedDate"))
    .withColumn("quarter", F.quarter("CreatedDate"))
)

3. Filtering by Date

Pandas

df = df[df["CreatedDate"] >= "2024-01-01"]
df = df[(df["CreatedDate"] >= "2024-01-01") & (df["CreatedDate"] < "2025-01-01")]

Spark (3 equivalent styles)

# Style 1: string comparison (works for date and timestamp columns)
df = df.filter(F.col("CreatedDate") >= "2024-01-01")

# Style 2: .between() — inclusive on both ends
df = df.filter(F.col("CreatedDate").between("2024-01-01", "2024-12-31"))

# Style 3: literal date
df = df.filter(F.col("CreatedDate") >= F.lit("2024-01-01").cast("date"))

# Combined range
df = df.filter(
    (F.col("CreatedDate") >= "2024-01-01") &
    (F.col("CreatedDate") <  "2025-01-01")
)

Tip: Spark string comparisons work correctly on DateType and TimestampType columns when the string is in yyyy-MM-dd format.


4. Filter by Year / Month (like .dt.year)

Pandas

df = df[df["CreatedDate"].dt.year == 2024]
df = df[df["CreatedDate"].dt.month == 3]

Spark

df = df.filter(F.year("CreatedDate") == 2024)
df = df.filter(F.month("CreatedDate") == 3)

# Year and month together
df = df.filter(
    (F.year("CreatedDate") == 2024) &
    (F.month("CreatedDate") == 3)
)

5. Date Arithmetic

Pandas

df["next_week"]  = df["CreatedDate"] + pd.Timedelta(days=7)
df["prev_month"] = df["CreatedDate"] - pd.DateOffset(months=1)
df["next_year"]  = df["CreatedDate"] + pd.DateOffset(years=1)

Spark

# Add / subtract days
df = df.withColumn("next_week",  F.date_add("CreatedDate", 7))
df = df.withColumn("prev_week",  F.date_sub("CreatedDate", 7))

# Add / subtract months
df = df.withColumn("next_month", F.add_months("CreatedDate",  1))
df = df.withColumn("prev_month", F.add_months("CreatedDate", -1))

# Add years (no direct function — use add_months with 12)
df = df.withColumn("next_year",  F.add_months("CreatedDate", 12))
df = df.withColumn("prev_year",  F.add_months("CreatedDate", -12))

6. Difference Between Two Dates

Pandas

df["days_diff"]   = (df["ModifiedDate"] - df["CreatedDate"]).dt.days
df["months_diff"] = (df["ModifiedDate"].dt.to_period("M") - df["CreatedDate"].dt.to_period("M")).n

Spark

# Difference in days
df = df.withColumn(
    "days_diff",
    F.datediff(F.col("ModifiedDate"), F.col("CreatedDate"))
)

# Difference in months (decimal — e.g., 3.94 months)
df = df.withColumn(
    "months_diff",
    F.months_between(F.col("ModifiedDate"), F.col("CreatedDate"))
)

# Difference in whole months (truncate to month start first)
df = df.withColumn(
    "whole_months",
    F.months_between(
        F.trunc("ModifiedDate", "month"),
        F.trunc("CreatedDate",  "month")
    ).cast("int")
)

# Difference in years
df = df.withColumn(
    "years_diff",
    (F.months_between("ModifiedDate", "CreatedDate") / 12).cast("int")
)

7. First Day / Last Day of Month

Pandas

df["first_day"] = df["CreatedDate"].dt.to_period("M").dt.to_timestamp()
df["last_day"]  = df["CreatedDate"] + pd.offsets.MonthEnd(0)

Spark

# First day of month
df = df.withColumn("first_day", F.trunc("CreatedDate", "month"))

# Last day of month
df = df.withColumn("last_day", F.last_day("CreatedDate"))

# First day of year
df = df.withColumn("first_day_of_year", F.trunc("CreatedDate", "year"))

# First day of quarter (use date_trunc)
df = df.withColumn("first_day_of_quarter", F.date_trunc("quarter", "CreatedDate"))

trunc() = date-level truncation, returns DateType. date_trunc() = timestamp-level truncation, returns TimestampType. Supports "year", "quarter", "month", "week", "day", "hour", "minute", "second".


8. Current Date and Time

Pandas

from datetime import datetime, date
today = date.today()
now   = datetime.now()

Spark

df = df.withColumn("today",    F.current_date())
df = df.withColumn("now",      F.current_timestamp())
df = df.withColumn("days_old", F.datediff(F.current_date(), F.col("CreatedDate")))

9. Conditional Date Replacement (when / otherwise)

This is the Spark equivalent of pandas’ np.where or df.loc[condition, col] = value.

Pandas

import numpy as np
df["CreatedDate"] = np.where(
    df["CreatedDate"].dt.year < 2024,
    pd.Timestamp("2024-01-01"),
    df["CreatedDate"]
)

Spark — F.when().otherwise()

df = df.withColumn(
    "CreatedDate",
    F.when(
        F.year("CreatedDate") < 2024,
        F.lit("2024-01-01").cast("date")
    ).otherwise(F.col("CreatedDate"))
)

# Cap a date — if modified date > today, set to today
df = df.withColumn(
    "ModifiedDate",
    F.when(
        F.col("ModifiedDate") > F.current_date(),
        F.current_date()
    ).otherwise(F.col("ModifiedDate"))
)

# Null handling — if CreatedDate is null, use ModifiedDate
df = df.withColumn(
    "EffectiveDate",
    F.when(F.col("CreatedDate").isNull(), F.col("ModifiedDate"))
     .otherwise(F.col("CreatedDate"))
)

# Multiple conditions — chaining when()
df = df.withColumn(
    "Period",
    F.when(F.year("CreatedDate") < 2022, F.lit("Pre-2022"))
     .when(F.year("CreatedDate") == 2022, F.lit("2022"))
     .when(F.year("CreatedDate") == 2023, F.lit("2023"))
     .otherwise(F.lit("2024+"))
)

10. Truncation Patterns (Delimitation)

A common pattern in data pipelines: snap dates to boundaries.

# Snap all dates before 2024-01-01 to exactly 2024-01-01
df = df.withColumn(
    "CreatedDate",
    F.when(
        F.col("CreatedDate") < F.lit("2024-01-01").cast("date"),
        F.lit("2024-01-01").cast("date")
    ).otherwise(F.col("CreatedDate"))
)

# Snap to start of month
df = df.withColumn("month_start", F.trunc("CreatedDate", "month"))

# Snap to start of quarter
df = df.withColumn("quarter_start", F.date_trunc("quarter", "CreatedDate").cast("date"))

# Snap to start of year
df = df.withColumn("year_start", F.trunc("CreatedDate", "year"))

# Round down to nearest Monday (start of week)
df = df.withColumn("week_start", F.date_trunc("week", "CreatedDate").cast("date"))

11. Generating Date Ranges (Equivalent to pd.date_range)

Pandas

pd.date_range(start="2024-01-01", end="2024-12-31", freq="D")
pd.period_range(start="2024-01", end="2024-12", freq="M")

Spark — sequence() + explode()

sequence() generates an array of dates between two dates. explode() turns each array into rows.

# Daily date spine
date_spine = spark.sql("""
    SELECT explode(sequence(DATE '2024-01-01', DATE '2024-12-31', INTERVAL 1 DAY)) AS date
""")

# Monthly date spine
monthly_spine = spark.sql("""
    SELECT explode(sequence(DATE '2024-01-01', DATE '2024-12-01', INTERVAL 1 MONTH)) AS month_start
""")

# Expand per-row date ranges into individual rows
df_expanded = (
    df
    .withColumn(
        "date_array",
        F.expr("sequence(to_date(CreatedDate), to_date(ModifiedDate), interval 1 day)")
    )
    .withColumn("date", F.explode("date_array"))
    .drop("date_array")
)

12. Months Between Two Dates — Whole Months

df = df.withColumn(
    "months_open",
    F.floor(
        F.months_between(
            F.col("ModifiedDate"),
            F.col("CreatedDate")
        )
    ).cast("int")
)

# Whole-calendar-months only
df = df.withColumn(
    "full_months",
    F.months_between(
        F.trunc("ModifiedDate", "month"),
        F.trunc("CreatedDate",  "month")
    ).cast("int")
)

13. Quick Reference Table

GoalPandasSpark
String → datetimepd.to_datetime(col)F.to_timestamp("col")
String → date only.dt.dateF.to_date("col")
Extract year.dt.yearF.year("col")
Extract month.dt.monthF.month("col")
Extract day.dt.dayF.dayofmonth("col")
Filter by datedf[col >= "2024-01-01"].filter(F.col("col") >= "2024-01-01")
Filter range(col >= a) & (col <= b)F.col("col").between(a, b)
Add days+ pd.Timedelta(days=7)F.date_add("col", 7)
Subtract days- pd.Timedelta(days=7)F.date_sub("col", 7)
Add months+ pd.DateOffset(months=1)F.add_months("col", 1)
Days between(a - b).dt.daysF.datediff(a, b)
Months betweenperiod subtractionF.months_between(a, b)
First of month.dt.to_period("M").dt.to_timestamp()F.trunc("col", "month")
Last of month+ pd.offsets.MonthEnd(0)F.last_day("col")
First of yearcustomF.trunc("col", "year")
First of quartercustomF.date_trunc("quarter", "col")
Start of weekcustomF.date_trunc("week", "col")
Todaydate.today()F.current_date()
Nowdatetime.now()F.current_timestamp()
Conditional replacenp.where(cond, val, col)F.when(cond, val).otherwise(col)
Date spine (daily)pd.date_range(...)sequence(start, end, interval 1 day) + explode()
Monthly spinepd.period_range(..., freq="M")sequence(start, end, interval 1 month) + explode()

14. Common Gotchas

1. String date comparisons work only if the column is already DateType or TimestampType.

If your column is still a string, cast it first:

df = df.withColumn("CreatedDate", F.to_date("CreatedDate"))
df = df.filter(F.col("CreatedDate") >= "2024-01-01")

2. F.lit() for date literals inside when().

F.lit("2024-01-01").cast("date")   # correct
F.lit(date(2024, 1, 1))           # also works

3. months_between returns a decimal.

Use F.floor() or .cast("int") to get whole months.

4. date_trunc returns TimestampType, trunc returns DateType.

F.trunc("col", "month")           # → DateType   2024-04-01
F.date_trunc("month", "col")      # → TimestampType  2024-04-01 00:00:00

Add .cast("date") after date_trunc if you need a plain date.

5. Java datetime format strings (not Python’s strftime).

PythonSpark
%Yyyyy
%mMM
%ddd
%HHH
%Mmm
%Sss