Subsetting a dataframe in pandas

  05 Jan 2019
  python

Importing packages and datasets

import pandas as pd
# Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header = None,
                   index_col = False,
                   names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
# Unique classes of iris datasets
data.iris_class.unique()

array([‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’], dtype=object)

Subsetting

data_setosa = data[data.iris_class == 'Iris-setosa']
data_versicolor = data[data.iris_class == 'Iris-versicolor']
data_virginica = data[data.iris_class == 'Iris-virginica']

'''
Now we can have a look at descriptive statistics summary for each of the subset and can make inference like following -
* Each of the subset is of same size i.e., 50
* Average Sepal and Petal Length is lowest in setosa and highest in virginica
'''

data_setosa.describe().T
countmeanstdmin25%50%75%max
sepal_length50.05.0060.3524904.34.8005.05.2005.8
sepal_width50.03.4180.3810242.33.1253.43.6754.4
petal_length50.01.4640.1735111.01.4001.51.5751.9
petal_width50.00.2440.1072100.10.2000.20.3000.6

data_versicolor.describe().T
countmeanstdmin25%50%75%max
sepal_length50.05.9360.5161714.95.6005.906.37.0
sepal_width50.02.7700.3137982.02.5252.803.03.4
petal_length50.04.2600.4699113.04.0004.354.65.1
petal_width50.01.3260.1977531.01.2001.301.51.8

data_virginica.describe().T
countmeanstdmin25%50%75%max
sepal_length50.06.5880.6358804.96.2256.506.9007.9
sepal_width50.02.9740.3224972.22.8003.003.1753.8
petal_length50.05.5520.5518954.55.1005.555.8756.9
petal_width50.02.0260.2746501.41.8002.002.3002.5