Pandas Series

Subsetting a DataFrame in Pandas

Learn different techniques to filter and subset pandas DataFrames efficiently

Subsetting a DataFrame in Pandas

Master the art of data filtering with practical examples using the Iris dataset. Learn different techniques to filter and subset pandas DataFrames efficiently for better data analysis workflows.

Importing Packages and Datasets

import pandas as pd

# Fetching data from URL as CSV by mentioning values of various parameters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None,
                   index_col=False,
                   names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'iris_class'])

# Unique classes of iris datasets
data.iris_class.unique()

Output:

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Subsetting Techniques

Basic Filtering

data_setosa = data[data.iris_class == 'Iris-setosa']
data_versicolor = data[data.iris_class == 'Iris-versicolor']
data_virginica = data[data.iris_class == 'Iris-virginica']

Key Insights

Now we can look at descriptive statistics summary for each subset and make inferences:

  • Each subset is of the same size (50 records)
  • Average Sepal and Petal Length is lowest in setosa and highest in virginica
  • This demonstrates clear species differentiation in the dataset

Descriptive Statistics

Iris Setosa

data_setosa.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.05.0060.3524904.34.8005.05.2005.8
sepal_width50.03.4180.3810242.33.1253.43.6754.4
petal_length50.01.4640.1735111.01.4001.51.5751.9
petal_width50.00.2440.1072100.10.2000.20.3000.6

Iris Versicolor

data_versicolor.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.05.9360.5161714.95.6005.906.37.0
sepal_width50.02.7700.3137982.02.5252.803.03.4
petal_length50.04.2600.4699113.04.0004.354.65.1
petal_width50.01.3260.1977531.01.2001.301.51.8

Iris Virginica

data_virginica.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.06.5880.6358804.96.2256.506.9007.9
sepal_width50.02.9740.3224972.22.8003.003.1753.8
petal_length50.05.5520.5518954.55.1005.555.8756.9
petal_width50.02.0260.2746501.41.8002.002.3002.5

Advanced Filtering Patterns

Multiple Conditions

# Filter with multiple conditions using boolean operators
large_setosa = data[(data.iris_class == 'Iris-setosa') & (data.sepal_length > 5.0)]

# Using query method (more readable for complex conditions)
large_setosa_query = data.query("iris_class == 'Iris-setosa' and sepal_length > 5.0")

# Multiple OR conditions
large_flowers = data[(data.sepal_length > 6.5) | (data.petal_length > 5.0)]

Using .loc and .iloc for Advanced Selection

# Label-based selection with .loc
setosa_sepal_data = data.loc[data.iris_class == 'Iris-setosa', ['sepal_length', 'sepal_width']]

# Position-based selection with .iloc
first_10_rows = data.iloc[:10, :]

# Combining boolean indexing with .loc
large_petals = data.loc[(data.petal_length > 4.0) & (data.petal_width > 1.5)]

String-based Filtering

# Filter by string patterns
setosa_variants = data[data.iris_class.str.contains('setosa')]

# Case-insensitive filtering
setosa_case_insensitive = data[data.iris_class.str.lower().str.contains('setosa')]

# Filter by string length
long_names = data[data.iris_class.str.len() > 12]

Numerical Range Filtering

# Filter by numerical ranges
medium_sepal_length = data[data.sepal_length.between(5.0, 6.0)]

# Using quantiles for filtering
q75 = data.sepal_length.quantile(0.75)
top_quartile = data[data.sepal_length >= q75]

# Filter outliers using IQR method
Q1 = data.sepal_length.quantile(0.25)
Q3 = data.sepal_length.quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data.sepal_length < (Q1 - 1.5 * IQR)) | 
                (data.sepal_length > (Q3 + 1.5 * IQR))]

Performance Tips and Best Practices

1. Use Vectorized Operations

# Good: Vectorized operation
filtered_data = data[data.sepal_length > 5.0]

# Avoid: Loop-based filtering (slow)
# filtered_rows = []
# for idx, row in data.iterrows():
#     if row['sepal_length'] > 5.0:
#         filtered_rows.append(row)

2. Chain Conditions Efficiently

# Efficient chaining with parentheses
complex_filter = data[
    (data.iris_class == 'Iris-setosa') & 
    (data.sepal_length > 5.0) & 
    (data.petal_width < 0.3)
]

# Use query() for very complex conditions
complex_query = data.query(
    "iris_class == 'Iris-setosa' and sepal_length > 5.0 and petal_width < 0.3"
)

3. Memory-Efficient Filtering

# For large datasets, consider using categorical data types
data_categorical = data.copy()
data_categorical['iris_class'] = data_categorical['iris_class'].astype('category')

# This reduces memory usage for string columns with repeated values
print(f"Original memory usage: {data.memory_usage(deep=True).sum()} bytes")
print(f"Categorical memory usage: {data_categorical.memory_usage(deep=True).sum()} bytes")

4. Index-based Filtering for Performance

# Set index for faster filtering on frequently used columns
data_indexed = data.set_index('iris_class')

# Now filtering by iris_class is much faster
setosa_indexed = data_indexed.loc['Iris-setosa']

Common Filtering Patterns

Filter by Multiple Values

# Filter by multiple specific values
species_of_interest = ['Iris-setosa', 'Iris-virginica']
filtered_species = data[data.iris_class.isin(species_of_interest)]

# Exclude specific values
not_versicolor = data[~data.iris_class.isin(['Iris-versicolor'])]

Filter by Missing Values

# Filter rows with missing values
has_missing = data[data.isnull().any(axis=1)]

# Filter rows without missing values
complete_rows = data[data.notnull().all(axis=1)]

# Filter specific column for missing values
missing_sepal_length = data[data.sepal_length.isnull()]

Conditional Replacement During Filtering

# Create filtered copy with conditional modifications
filtered_modified = data[data.iris_class == 'Iris-setosa'].copy()
filtered_modified.loc[filtered_modified.sepal_length > 5.0, 'size_category'] = 'large'
filtered_modified.loc[filtered_modified.sepal_length <= 5.0, 'size_category'] = 'small'

Quick Reference: Filtering Cheat Sheet

OperationSyntaxUse Case
Single conditiondf[df['col'] > value]Basic filtering
Multiple AND conditionsdf[(df['col1'] > val1) & (df['col2'] < val2)]Complex filtering
Multiple OR conditionsdf[(df['col1'] > val1) | (df['col2'] < val2)]Alternative conditions
String containsdf[df['col'].str.contains('pattern')]Text pattern matching
Value in listdf[df['col'].isin([val1, val2])]Multiple value matching
Between valuesdf[df['col'].between(min_val, max_val)]Range filtering
Query methoddf.query("col > value and col2 < value2")Readable complex conditions
Label-baseddf.loc[condition, columns]Specific rows and columns

This subsetting technique is fundamental to many other pandas operations:

  • Data Loading Patterns - Filter data during or after loading
  • Indexing and Sorting - Combine with sorting for ordered subsets
  • Join Operations - Filter before joins for better performance
  • Groupby Operations - Subset data before grouping for targeted analysis

Master these filtering patterns and you’ll handle 90% of data subsetting challenges efficiently!