Join, Merge, Append and Concatenate

  25 Mar 2019
  python

Working with multiple data frames often involves joining two or more tables to in bring out more no. of columns from another table by joining on some sort of relationship which exists within a table or appending two tables which is adding one or more table over another table with keeping the same order of columns.


Example of append data -> monthly files of revenue sheets of a company and wee need at end of the year to be clubbed into single table.

alt text      alt text

alt text

alt text


Example of merging -> multiple files regarding employee education, compensation, performance all linked to each other in some identifier in each one of them which maps to employee master table and for doing analysis we need data from each of these tables in the same which can be achieved by merging.


We’ll look out for merging/joining two tables now and later will discuss the possibilities around appending to tables using pandas.
To begin with let’s get create some dummy datasets.

import pandas as pd
states_codes = pd.DataFrame({'State': ['Haryana', 'Punjab', 'Rajasthan', 'Uttar Pradesh', 'Madhya Pradesh'],
                        'Code': ['HR', 'PB', 'RJ', 'UP', 'MP']})
states_area = pd.DataFrame({'State': ['Haryana', 'Punjab',  'Uttar Pradesh', 'Bihar'],
                        'Area_InSquareKM': [44212, 50362, 243290, 94165]})
# a dummy data file
states_literacyrate = pd.read_csv('literacy.csv')
states_codes
StateCode
0HaryanaHR
1PunjabPB
2RajasthanRJ
3Uttar PradeshUP
4Madhya PradeshMP
states_area
StateArea_InSquareKM
0Haryana44212
1Punjab50362
2Uttar Pradesh243290
3Bihar94165
states_literacyrate
StateYearLiteracy Rate
0Haryana201176.64
1Rajasthan201167.06
2Uttar Pradesh201169.72
3Haryana200167.91
4Uttar Pradesh200156.27
5Rajasthan200160.41

As you can see State column repeats across all three tables, meaning to say that in case it is required to pull out from these three table

# Merge vs Join
area_codes_inner = pd.merge(states_area,states_codes)
area_codes_inner
StateArea_InSquareKMCode
0Haryana44212HR
1Punjab50362PB
2Uttar Pradesh243290UP
results_inner = pd.merge(area_codes_inner,states_literacyrate)
results_inner
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212HR201176.64
1Haryana44212HR200167.91
2Uttar Pradesh243290UP201169.72
3Uttar Pradesh243290UP200156.27
area_codes_left = pd.merge(states_area,states_codes,how ='left')
area_codes_left
StateArea_InSquareKMCode
0Haryana44212HR
1Punjab50362PB
2Uttar Pradesh243290UP
3Bihar94165NaN
results_left = pd.merge(area_codes_left,states_literacyrate,how ='left')
results_left
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212HR2011.076.64
1Haryana44212HR2001.067.91
2Punjab50362PBNaNNaN
3Uttar Pradesh243290UP2011.069.72
4Uttar Pradesh243290UP2001.056.27
5Bihar94165NaNNaNNaN
area_codes_right = pd.merge(states_area,states_codes,how ='right')
area_codes_right
StateArea_InSquareKMCode
0Haryana44212.0HR
1Punjab50362.0PB
2Uttar Pradesh243290.0UP
3RajasthanNaNRJ
4Madhya PradeshNaNMP
results_right = pd.merge(area_codes_right,states_literacyrate,how ='right')
results_right
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212.0HR201176.64
1Haryana44212.0HR200167.91
2Uttar Pradesh243290.0UP201169.72
3Uttar Pradesh243290.0UP200156.27
4RajasthanNaNRJ201167.06
5RajasthanNaNRJ200160.41
area_codes_outer = pd.merge(states_area,states_codes, how = 'outer')
area_codes_outer
StateArea_InSquareKMCode
0Haryana44212.0HR
1Punjab50362.0PB
2Uttar Pradesh243290.0UP
3Bihar94165.0NaN
4RajasthanNaNRJ
5Madhya PradeshNaNMP
results_outer = pd.merge(area_codes_outer,states_literacyrate,how ='outer')
results_outer
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212.0HR2011.076.64
1Haryana44212.0HR2001.067.91
2Punjab50362.0PBNaNNaN
3Uttar Pradesh243290.0UP2011.069.72
4Uttar Pradesh243290.0UP2001.056.27
5Bihar94165.0NaNNaNNaN
6RajasthanNaNRJ2011.067.06
7RajasthanNaNRJ2001.060.41
8Madhya PradeshNaNMPNaNNaN

other options in merge function are - left_on right_on on left_index right_index sort

a detailed description is can be found here - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

states_area
StateArea_InSquareKM
0Haryana44212
1Punjab50362
2Uttar Pradesh243290
3Bihar94165
states_codes
StateCode
0HaryanaHR
1PunjabPB
2RajasthanRJ
3Uttar PradeshUP
4Madhya PradeshMP

– .merge : For column(s)-on-columns(s) operations – .join : Join DataFrames using their indexes., if need to be on specific keys, then set keys to be the index

states_area.join(states_literacyrate,lsuffix='_FromArea',rsuffix='_FromLiteracy', how='outer')
State_FromAreaArea_InSquareKMState_FromLiteracyYearLiteracy Rate
0Haryana44212.0Haryana201176.64
1Punjab50362.0Rajasthan201167.06
2Uttar Pradesh243290.0Uttar Pradesh201169.72
3Bihar94165.0Haryana200167.91
4NaNNaNUttar Pradesh200156.27
5NaNNaNRajasthan200160.41
pd.merge(states_area,states_literacyrate, left_index=True, right_index=True, how='outer')
State_xArea_InSquareKMState_yYearLiteracy Rate
0Haryana44212.0Haryana201176.64
1Punjab50362.0Rajasthan201167.06
2Uttar Pradesh243290.0Uttar Pradesh201169.72
3Bihar94165.0Haryana200167.91
4NaNNaNUttar Pradesh200156.27
5NaNNaNRajasthan200160.41
## merge on multiple columns
city_states_literacy = pd.read_csv('City_States_Literacy.csv')
city_states_area = pd.read_csv('City_States_Area.csv')
city_states_area
StateCityArea_InSquareKM
0Haryana1983
1Haryana2886
2Rajasthan1881
3Rajasthan2980
4Rajasthan3895
5Rajasthan41010
6Rajasthan51075
7Bihar1802
8Bihar2859
9Bihar31020
10Uttar Pradesh4945
11Uttar Pradesh5787
12Madhya Pradhes6983
results_inner = pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City'])
results_inner
StateCityLiteracy RateArea_InSquareKM
0Haryana174.78983
1Haryana263.80886
2Rajasthan161.68881
3Rajasthan268.61980
4Rajasthan370.78895
5Rajasthan475.801010
6Rajasthan580.651075
7Uttar Pradesh473.77945
8Uttar Pradesh560.61787
pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City'])
StateCityLiteracy RateArea_InSquareKM
0Haryana174.78983
1Haryana263.80886
2Rajasthan161.68881
3Rajasthan268.61980
4Rajasthan370.78895
5Rajasthan475.801010
6Rajasthan580.651075
7Uttar Pradesh473.77945
8Uttar Pradesh560.61787