There are multiple ways in pandas by which a dataframe can be indexed i.e, selecting particular set of rows and columns from a dataframe. For a detailed description over this topic, once can refer official pandas documentation - Indexing and Selecting Data
We’ll discuss the following -
- Integer Based Indexing - iloc
- Label Based Indexing - loc
- Setting Values using loc
- Sorting using indexes via argsort
Let’s begin with loading a sample dataset and required python packages.
import pandas as pd
import numpy as np
#Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header = None,
index_col = False,
names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
# visualising first five rows of sample dataset (Iris)
data.head()
The sample first five rows of data looks like (can be viewed using data.head())-
sepal_length | sepal_width | petal_length | petal_width | iris_class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Integer based indexing using iloc
To select some fixed no. of column and a fixed no. of rows from this data, one way is to achieve it by using iloc operation. The first part of indexing will be for rows and another will be columns (indexes starting from 0 to total no. of rows/columns).
For example, first 10 rows for last three columns can be extracted by -
to pass a range : can be used while indexing [start:end], start being inclusive and end being exclusive
data.iloc[0:10,2:5]
petal_length | petal_width | iris_class | |
---|---|---|---|
0 | 1.4 | 0.2 | Iris-setosa |
1 | 1.4 | 0.2 | Iris-setosa |
2 | 1.3 | 0.2 | Iris-setosa |
3 | 1.5 | 0.2 | Iris-setosa |
4 | 1.4 | 0.2 | Iris-setosa |
5 | 1.7 | 0.4 | Iris-setosa |
6 | 1.4 | 0.3 | Iris-setosa |
7 | 1.5 | 0.2 | Iris-setosa |
8 | 1.4 | 0.2 | Iris-setosa |
9 | 1.5 | 0.1 | Iris-setosa |
Similarly, specific rows and columns can be extracted using indexes of the corresponding elements.
The following command, will return 1st and 2nd row of 2nd and 4th column.
data.iloc[[0,1],[1,3]]
sepal_width | petal_width | |
---|---|---|
0 | 3.5 | 0.2 |
1 | 3.0 | 0.2 |
Another example, to extracting first 10 rows and all columns but first two.
data.iloc[:10,2:]
petal_length | petal_width | iris_class | |
---|---|---|---|
0 | 1.4 | 0.2 | Iris-setosa |
1 | 1.4 | 0.2 | Iris-setosa |
2 | 1.3 | 0.2 | Iris-setosa |
3 | 1.5 | 0.2 | Iris-setosa |
4 | 1.4 | 0.2 | Iris-setosa |
5 | 1.7 | 0.4 | Iris-setosa |
6 | 1.4 | 0.3 | Iris-setosa |
7 | 1.5 | 0.2 | Iris-setosa |
8 | 1.4 | 0.2 | Iris-setosa |
9 | 1.5 | 0.1 | Iris-setosa |
Labels based indexing using loc
To index a dataframe based on column names, loc can be used.
For example, to get all the columns between petal_length till iris class and records from 2nd to 10th, can be extracted by using -
data.loc[2:10,'petal_length':'iris_class']
petal_length | petal_width | iris_class | |
---|---|---|---|
2 | 1.3 | 0.2 | Iris-setosa |
3 | 1.5 | 0.2 | Iris-setosa |
4 | 1.4 | 0.2 | Iris-setosa |
5 | 1.7 | 0.4 | Iris-setosa |
6 | 1.4 | 0.3 | Iris-setosa |
7 | 1.5 | 0.2 | Iris-setosa |
8 | 1.4 | 0.2 | Iris-setosa |
9 | 1.5 | 0.1 | Iris-setosa |
10 | 1.5 | 0.2 | Iris-setosa |
Similarly, specific column names can be passed in a list for which we want to fetch the data.
data.loc[2:10,['petal_length','iris_class']]
petal_length | iris_class | |
---|---|---|
2 | 1.3 | Iris-setosa |
3 | 1.5 | Iris-setosa |
4 | 1.4 | Iris-setosa |
5 | 1.7 | Iris-setosa |
6 | 1.4 | Iris-setosa |
7 | 1.5 | Iris-setosa |
8 | 1.4 | Iris-setosa |
9 | 1.5 | Iris-setosa |
10 | 1.5 | Iris-setosa |
Setting values using loc
.loc can be used for setting values of particular records based on some predefined filter queries.
For example, there is a need to set petal_length of for first 11 records, except first two equals to 30 can be achieved by -
data.loc[2:10,'petal_length'] = 30
# Current state of those records -
data.loc[2:10,['petal_length']]
petal_length | |
---|---|
2 | 1.3 |
3 | 1.5 |
4 | 1.4 |
5 | 1.7 |
6 | 1.4 |
7 | 1.5 |
8 | 1.4 |
9 | 1.5 |
10 | 1.5 |
data.loc[2:10,'petal_length'] = 30
data.loc[2:10]
sepal_length | sepal_width | petal_length | petal_width | iris_class | |
---|---|---|---|---|---|
2 | 4.7 | 3.2 | 30.0 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 30.0 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 30.0 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 30.0 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 30.0 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 30.0 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 30.0 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 30.0 | 0.1 | Iris-setosa |
10 | 5.4 | 3.7 | 30.0 | 0.2 | Iris-setosa |
Another example of using .loc for setting values, make all records equal to null where ever sepal_width is greater than 3
# current state of data
data.head()
sepal_length | sepal_width | petal_length | petal_width | iris_class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 30.0 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 30.0 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 30.0 | 0.2 | Iris-setosa |
data.loc[data['sepal_width']>3] = np.nan
data.head()
sepal_length | sepal_width | petal_length | petal_width | iris_class | |
---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN |
Sorting
Sorting a dataframe in python can be done in multiple ways. We’ll be looking two of those here -
- sort_values
- argsort
Let’s begin with reloading the dataset again and have a look at first six rows of sepal_width column.
#Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header = None,
index_col = False,
names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
sample_data = data.iloc[0:6,[1]]
sample_data
sepal_width | |
---|---|
0 | 3.5 |
1 | 3.0 |
2 | 3.2 |
3 | 3.1 |
4 | 3.6 |
5 | 3.9 |
Using sort_values
Simples way to sort a dataframe can be done using sort_values function of pandas dataframe, which take the column name argument on which the sorting is to be done.
sample_data.sort_values(by='sepal_width')
sepal_width | |
---|---|
1 | 3.0 |
3 | 3.1 |
2 | 3.2 |
0 | 3.5 |
4 | 3.6 |
5 | 3.9 |
Looking at the indexes of the sorted values, gives us the following results.
sample_data.sort_values(by='sepal_width').index.values
array([1, 3, 2, 0, 4, 5], dtype=int64)
Using argsort
Another way to sort a dataframe can be acheived by using argsort, which basically returns the list of indexes which will sort the values. That list of indexes can be passed to .iloc indexing and the output will return an sorted column.
To read more about argsort, please follow - https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
sample_data['sepal_width'].values.argsort(axis=0)
array([1, 3, 2, 0, 4, 5], dtype=int64)
sample_data.iloc[sample_data['sepal_width'].values.argsort(axis=0)]
sepal_width | |
---|---|
1 | 3.0 |
3 | 3.1 |
2 | 3.2 |
0 | 3.5 |
4 | 3.6 |
5 | 3.9 |
Notebook Link - Indexing and Sorting a dataframe using iloc and loc