Working with multiple data frames often involves joining two or more tables to in bring out more no. of columns from another table by joining on some sort of relationship which exists within a table or appending two tables which is adding one or more table over another table with keeping the same order of columns.
Example of append data -> monthly files of revenue sheets of a company and wee need at end of the year to be clubbed into single table.
Example of merging -> multiple files regarding employee education, compensation, performance all linked to each other in some identifier in each one of them which maps to employee master table and for doing analysis we need data from each of these tables in the same which can be achieved by merging.
We’ll look out for merging/joining two tables now and later will discuss the possibilities around appending to tables using pandas.
To begin with let’s get create some dummy datasets.
import pandas as pd
states_codes = pd.DataFrame({'State': ['Haryana', 'Punjab', 'Rajasthan', 'Uttar Pradesh', 'Madhya Pradesh'],
'Code': ['HR', 'PB', 'RJ', 'UP', 'MP']})
states_area = pd.DataFrame({'State': ['Haryana', 'Punjab', 'Uttar Pradesh', 'Bihar'],
'Area_InSquareKM': [44212, 50362, 243290, 94165]})
# a dummy data file
states_literacyrate = pd.read_csv('literacy.csv')
| State | Code |
---|
0 | Haryana | HR |
---|
1 | Punjab | PB |
---|
2 | Rajasthan | RJ |
---|
3 | Uttar Pradesh | UP |
---|
4 | Madhya Pradesh | MP |
---|
| State | Area_InSquareKM |
---|
0 | Haryana | 44212 |
---|
1 | Punjab | 50362 |
---|
2 | Uttar Pradesh | 243290 |
---|
3 | Bihar | 94165 |
---|
| State | Year | Literacy Rate |
---|
0 | Haryana | 2011 | 76.64 |
---|
1 | Rajasthan | 2011 | 67.06 |
---|
2 | Uttar Pradesh | 2011 | 69.72 |
---|
3 | Haryana | 2001 | 67.91 |
---|
4 | Uttar Pradesh | 2001 | 56.27 |
---|
5 | Rajasthan | 2001 | 60.41 |
---|
As you can see State column repeats across all three tables, meaning to say that in case it is required to pull out from these three table
area_codes_inner = pd.merge(states_area,states_codes)
area_codes_inner
| State | Area_InSquareKM | Code |
---|
0 | Haryana | 44212 | HR |
---|
1 | Punjab | 50362 | PB |
---|
2 | Uttar Pradesh | 243290 | UP |
---|
results_inner = pd.merge(area_codes_inner,states_literacyrate)
results_inner
| State | Area_InSquareKM | Code | Year | Literacy Rate |
---|
0 | Haryana | 44212 | HR | 2011 | 76.64 |
---|
1 | Haryana | 44212 | HR | 2001 | 67.91 |
---|
2 | Uttar Pradesh | 243290 | UP | 2011 | 69.72 |
---|
3 | Uttar Pradesh | 243290 | UP | 2001 | 56.27 |
---|
area_codes_left = pd.merge(states_area,states_codes,how ='left')
area_codes_left
| State | Area_InSquareKM | Code |
---|
0 | Haryana | 44212 | HR |
---|
1 | Punjab | 50362 | PB |
---|
2 | Uttar Pradesh | 243290 | UP |
---|
3 | Bihar | 94165 | NaN |
---|
results_left = pd.merge(area_codes_left,states_literacyrate,how ='left')
results_left
| State | Area_InSquareKM | Code | Year | Literacy Rate |
---|
0 | Haryana | 44212 | HR | 2011.0 | 76.64 |
---|
1 | Haryana | 44212 | HR | 2001.0 | 67.91 |
---|
2 | Punjab | 50362 | PB | NaN | NaN |
---|
3 | Uttar Pradesh | 243290 | UP | 2011.0 | 69.72 |
---|
4 | Uttar Pradesh | 243290 | UP | 2001.0 | 56.27 |
---|
5 | Bihar | 94165 | NaN | NaN | NaN |
---|
area_codes_right = pd.merge(states_area,states_codes,how ='right')
area_codes_right
| State | Area_InSquareKM | Code |
---|
0 | Haryana | 44212.0 | HR |
---|
1 | Punjab | 50362.0 | PB |
---|
2 | Uttar Pradesh | 243290.0 | UP |
---|
3 | Rajasthan | NaN | RJ |
---|
4 | Madhya Pradesh | NaN | MP |
---|
results_right = pd.merge(area_codes_right,states_literacyrate,how ='right')
results_right
| State | Area_InSquareKM | Code | Year | Literacy Rate |
---|
0 | Haryana | 44212.0 | HR | 2011 | 76.64 |
---|
1 | Haryana | 44212.0 | HR | 2001 | 67.91 |
---|
2 | Uttar Pradesh | 243290.0 | UP | 2011 | 69.72 |
---|
3 | Uttar Pradesh | 243290.0 | UP | 2001 | 56.27 |
---|
4 | Rajasthan | NaN | RJ | 2011 | 67.06 |
---|
5 | Rajasthan | NaN | RJ | 2001 | 60.41 |
---|
area_codes_outer = pd.merge(states_area,states_codes, how = 'outer')
area_codes_outer
| State | Area_InSquareKM | Code |
---|
0 | Haryana | 44212.0 | HR |
---|
1 | Punjab | 50362.0 | PB |
---|
2 | Uttar Pradesh | 243290.0 | UP |
---|
3 | Bihar | 94165.0 | NaN |
---|
4 | Rajasthan | NaN | RJ |
---|
5 | Madhya Pradesh | NaN | MP |
---|
results_outer = pd.merge(area_codes_outer,states_literacyrate,how ='outer')
results_outer
| State | Area_InSquareKM | Code | Year | Literacy Rate |
---|
0 | Haryana | 44212.0 | HR | 2011.0 | 76.64 |
---|
1 | Haryana | 44212.0 | HR | 2001.0 | 67.91 |
---|
2 | Punjab | 50362.0 | PB | NaN | NaN |
---|
3 | Uttar Pradesh | 243290.0 | UP | 2011.0 | 69.72 |
---|
4 | Uttar Pradesh | 243290.0 | UP | 2001.0 | 56.27 |
---|
5 | Bihar | 94165.0 | NaN | NaN | NaN |
---|
6 | Rajasthan | NaN | RJ | 2011.0 | 67.06 |
---|
7 | Rajasthan | NaN | RJ | 2001.0 | 60.41 |
---|
8 | Madhya Pradesh | NaN | MP | NaN | NaN |
---|
other options in merge function are - left_on right_on on left_index right_index sort
a detailed description is can be found here - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
| State | Area_InSquareKM |
---|
0 | Haryana | 44212 |
---|
1 | Punjab | 50362 |
---|
2 | Uttar Pradesh | 243290 |
---|
3 | Bihar | 94165 |
---|
| State | Code |
---|
0 | Haryana | HR |
---|
1 | Punjab | PB |
---|
2 | Rajasthan | RJ |
---|
3 | Uttar Pradesh | UP |
---|
4 | Madhya Pradesh | MP |
---|
– .merge : For column(s)-on-columns(s) operations – .join : Join DataFrames using their indexes., if need to be on specific keys, then set keys to be the index
states_area.join(states_literacyrate,lsuffix='_FromArea',rsuffix='_FromLiteracy', how='outer')
| State_FromArea | Area_InSquareKM | State_FromLiteracy | Year | Literacy Rate |
---|
0 | Haryana | 44212.0 | Haryana | 2011 | 76.64 |
---|
1 | Punjab | 50362.0 | Rajasthan | 2011 | 67.06 |
---|
2 | Uttar Pradesh | 243290.0 | Uttar Pradesh | 2011 | 69.72 |
---|
3 | Bihar | 94165.0 | Haryana | 2001 | 67.91 |
---|
4 | NaN | NaN | Uttar Pradesh | 2001 | 56.27 |
---|
5 | NaN | NaN | Rajasthan | 2001 | 60.41 |
---|
pd.merge(states_area,states_literacyrate, left_index=True, right_index=True, how='outer')
| State_x | Area_InSquareKM | State_y | Year | Literacy Rate |
---|
0 | Haryana | 44212.0 | Haryana | 2011 | 76.64 |
---|
1 | Punjab | 50362.0 | Rajasthan | 2011 | 67.06 |
---|
2 | Uttar Pradesh | 243290.0 | Uttar Pradesh | 2011 | 69.72 |
---|
3 | Bihar | 94165.0 | Haryana | 2001 | 67.91 |
---|
4 | NaN | NaN | Uttar Pradesh | 2001 | 56.27 |
---|
5 | NaN | NaN | Rajasthan | 2001 | 60.41 |
---|
## merge on multiple columns
city_states_literacy = pd.read_csv('City_States_Literacy.csv')
city_states_area = pd.read_csv('City_States_Area.csv')
| State | City | Area_InSquareKM |
---|
0 | Haryana | 1 | 983 |
---|
1 | Haryana | 2 | 886 |
---|
2 | Rajasthan | 1 | 881 |
---|
3 | Rajasthan | 2 | 980 |
---|
4 | Rajasthan | 3 | 895 |
---|
5 | Rajasthan | 4 | 1010 |
---|
6 | Rajasthan | 5 | 1075 |
---|
7 | Bihar | 1 | 802 |
---|
8 | Bihar | 2 | 859 |
---|
9 | Bihar | 3 | 1020 |
---|
10 | Uttar Pradesh | 4 | 945 |
---|
11 | Uttar Pradesh | 5 | 787 |
---|
12 | Madhya Pradhes | 6 | 983 |
---|
results_inner = pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City'])
results_inner
| State | City | Literacy Rate | Area_InSquareKM |
---|
0 | Haryana | 1 | 74.78 | 983 |
---|
1 | Haryana | 2 | 63.80 | 886 |
---|
2 | Rajasthan | 1 | 61.68 | 881 |
---|
3 | Rajasthan | 2 | 68.61 | 980 |
---|
4 | Rajasthan | 3 | 70.78 | 895 |
---|
5 | Rajasthan | 4 | 75.80 | 1010 |
---|
6 | Rajasthan | 5 | 80.65 | 1075 |
---|
7 | Uttar Pradesh | 4 | 73.77 | 945 |
---|
8 | Uttar Pradesh | 5 | 60.61 | 787 |
---|
pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City'])
| State | City | Literacy Rate | Area_InSquareKM |
---|
0 | Haryana | 1 | 74.78 | 983 |
---|
1 | Haryana | 2 | 63.80 | 886 |
---|
2 | Rajasthan | 1 | 61.68 | 881 |
---|
3 | Rajasthan | 2 | 68.61 | 980 |
---|
4 | Rajasthan | 3 | 70.78 | 895 |
---|
5 | Rajasthan | 4 | 75.80 | 1010 |
---|
6 | Rajasthan | 5 | 80.65 | 1075 |
---|
7 | Uttar Pradesh | 4 | 73.77 | 945 |
---|
8 | Uttar Pradesh | 5 | 60.61 | 787 |
---|