Subsetting a dataframe in pandas

Importing packages and datasets

import pandas as pd
# Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header = None,
                   index_col = False,
                   names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
# Unique classes of iris datasets
data.iris_class.unique()

array([‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’], dtype=object)

Subsetting

data_setosa = data[data.iris_class == 'Iris-setosa']
data_versicolor = data[data.iris_class == 'Iris-versicolor']
data_virginica = data[data.iris_class == 'Iris-virginica']

'''
Now we can have a look at descriptive statistics summary for each of the subset and can make inference like following -
* Each of the subset is of same size i.e., 50
* Average Sepal and Petal Length is lowest in setosa and highest in virginica
'''

data_setosa.describe().T

	count	mean	std	min	25%	50%	75%	max
sepal_length	50.0	5.006	0.352490	4.3	4.800	5.0	5.200	5.8
sepal_width	50.0	3.418	0.381024	2.3	3.125	3.4	3.675	4.4
petal_length	50.0	1.464	0.173511	1.0	1.400	1.5	1.575	1.9
petal_width	50.0	0.244	0.107210	0.1	0.200	0.2	0.300	0.6

data_versicolor.describe().T

	count	mean	std	min	25%	50%	75%	max
sepal_length	50.0	5.936	0.516171	4.9	5.600	5.90	6.3	7.0
sepal_width	50.0	2.770	0.313798	2.0	2.525	2.80	3.0	3.4
petal_length	50.0	4.260	0.469911	3.0	4.000	4.35	4.6	5.1
petal_width	50.0	1.326	0.197753	1.0	1.200	1.30	1.5	1.8

data_virginica.describe().T

	count	mean	std	min	25%	50%	75%	max
sepal_length	50.0	6.588	0.635880	4.9	6.225	6.50	6.900	7.9
sepal_width	50.0	2.974	0.322497	2.2	2.800	3.00	3.175	3.8
petal_length	50.0	5.552	0.551895	4.5	5.100	5.55	5.875	6.9
petal_width	50.0	2.026	0.274650	1.4	1.800	2.00	2.300	2.5

Importing packages and datasets

Subsetting

Related Posts