Subsetting a DataFrame in Pandas

Learn different techniques to filter and subset pandas DataFrames efficiently

Importing Packages and Datasets

import pandas as pd

# Fetching data from URL as CSV by mentioning values of various parameters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None,
                   index_col=False,
                   names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'iris_class'])

# Unique classes of iris datasets
data.iris_class.unique()

Output:

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Subsetting Techniques

Basic Filtering

data_setosa = data[data.iris_class == 'Iris-setosa']
data_versicolor = data[data.iris_class == 'Iris-versicolor']
data_virginica = data[data.iris_class == 'Iris-virginica']

Key Insights

Now we can look at descriptive statistics summary for each subset and make inferences:

  • Each subset is of the same size (50 records)
  • Average Sepal and Petal Length is lowest in setosa and highest in virginica
  • This demonstrates clear species differentiation in the dataset

Descriptive Statistics

Iris Setosa

data_setosa.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.05.0060.3524904.34.8005.05.2005.8
sepal_width50.03.4180.3810242.33.1253.43.6754.4
petal_length50.01.4640.1735111.01.4001.51.5751.9
petal_width50.00.2440.1072100.10.2000.20.3000.6

Iris Versicolor

data_versicolor.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.05.9360.5161714.95.6005.906.37.0
sepal_width50.02.7700.3137982.02.5252.803.03.4
petal_length50.04.2600.4699113.04.0004.354.65.1
petal_width50.01.3260.1977531.01.2001.301.51.8

Iris Virginica

data_virginica.describe().T
Metriccountmeanstdmin25%50%75%max
sepal_length50.06.5880.6358804.96.2256.506.9007.9
sepal_width50.02.9740.3224972.22.8003.003.1753.8
petal_length50.05.5520.5518954.55.1005.555.8756.9
petal_width50.02.0260.2746501.41.8002.002.3002.5

Advanced Filtering Patterns

Multiple Conditions

# Filter with multiple conditions
large_setosa = data[(data.iris_class == 'Iris-setosa') & (data.sepal_length > 5.0)]

# Using query method (more readable)
large_setosa_query = data.query("iris_class == 'Iris-setosa' and sepal_length > 5.0")

Performance Tips

  1. Use vectorized operations instead of loops
  2. Chain conditions with & and | operators
  3. Use .query() for complex conditions (more readable)
  4. Consider .loc[] for label-based indexing
  • Data Loading Patterns (coming soon)
  • Indexing and Sorting (coming soon)
  • Join Operations (coming soon)