Join, Merge, Append and Concatenate

Table of Contents

Working with multiple data frames often involves joining two or more tables to in bring out more no. of columns from another table by joining on some sort of relationship which exists within a table or appending two tables which is adding one or more table over another table with keeping the same order of columns.


Example of append data -> monthly files of revenue sheets of a company and wee need at end of the year to be clubbed into single table.

alt text      alt text

alt text

alt text


Example of merging -> multiple files regarding employee education, compensation, performance all linked to each other in some identifier in each one of them which maps to employee master table and for doing analysis we need data from each of these tables in the same which can be achieved by merging.


We’ll look out for merging/joining two tables now and later will discuss the possibilities around appending to tables using pandas.
To begin with let’s get create some dummy datasets.

import pandas as pd
states_codes = pd.DataFrame({'State': ['Haryana', 'Punjab', 'Rajasthan', 'Uttar Pradesh', 'Madhya Pradesh'],
                        'Code': ['HR', 'PB', 'RJ', 'UP', 'MP']})
states_area = pd.DataFrame({'State': ['Haryana', 'Punjab',  'Uttar Pradesh', 'Bihar'],
                        'Area_InSquareKM': [44212, 50362, 243290, 94165]})
# a dummy data file
states_literacyrate = pd.read_csv('literacy.csv')
states_codes
StateCode
0HaryanaHR
1PunjabPB
2RajasthanRJ
3Uttar PradeshUP
4Madhya PradeshMP
```python states_area ```
StateArea_InSquareKM
0Haryana44212
1Punjab50362
2Uttar Pradesh243290
3Bihar94165
```python states_literacyrate ```
StateYearLiteracy Rate
0Haryana201176.64
1Rajasthan201167.06
2Uttar Pradesh201169.72
3Haryana200167.91
4Uttar Pradesh200156.27
5Rajasthan200160.41
As you can see State column repeats across all three tables, meaning to say that in case it is required to pull out from these three table
# Merge vs Join
area_codes_inner = pd.merge(states_area,states_codes)
area_codes_inner
StateArea_InSquareKMCode
0Haryana44212HR
1Punjab50362PB
2Uttar Pradesh243290UP
```python results_inner = pd.merge(area_codes_inner,states_literacyrate) results_inner ```
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212HR201176.64
1Haryana44212HR200167.91
2Uttar Pradesh243290UP201169.72
3Uttar Pradesh243290UP200156.27
```python area_codes_left = pd.merge(states_area,states_codes,how ='left') area_codes_left ```
StateArea_InSquareKMCode
0Haryana44212HR
1Punjab50362PB
2Uttar Pradesh243290UP
3Bihar94165NaN
```python results_left = pd.merge(area_codes_left,states_literacyrate,how ='left') results_left ```
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212HR2011.076.64
1Haryana44212HR2001.067.91
2Punjab50362PBNaNNaN
3Uttar Pradesh243290UP2011.069.72
4Uttar Pradesh243290UP2001.056.27
5Bihar94165NaNNaNNaN
```python area_codes_right = pd.merge(states_area,states_codes,how ='right') area_codes_right ```
StateArea_InSquareKMCode
0Haryana44212.0HR
1Punjab50362.0PB
2Uttar Pradesh243290.0UP
3RajasthanNaNRJ
4Madhya PradeshNaNMP
```python results_right = pd.merge(area_codes_right,states_literacyrate,how ='right') results_right ```
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212.0HR201176.64
1Haryana44212.0HR200167.91
2Uttar Pradesh243290.0UP201169.72
3Uttar Pradesh243290.0UP200156.27
4RajasthanNaNRJ201167.06
5RajasthanNaNRJ200160.41
```python area_codes_outer = pd.merge(states_area,states_codes, how = 'outer') area_codes_outer ```
StateArea_InSquareKMCode
0Haryana44212.0HR
1Punjab50362.0PB
2Uttar Pradesh243290.0UP
3Bihar94165.0NaN
4RajasthanNaNRJ
5Madhya PradeshNaNMP
```python results_outer = pd.merge(area_codes_outer,states_literacyrate,how ='outer') results_outer ```
StateArea_InSquareKMCodeYearLiteracy Rate
0Haryana44212.0HR2011.076.64
1Haryana44212.0HR2001.067.91
2Punjab50362.0PBNaNNaN
3Uttar Pradesh243290.0UP2011.069.72
4Uttar Pradesh243290.0UP2001.056.27
5Bihar94165.0NaNNaNNaN
6RajasthanNaNRJ2011.067.06
7RajasthanNaNRJ2001.060.41
8Madhya PradeshNaNMPNaNNaN
other options in merge function are - left_on right_on on left_index right_index sort

a detailed description is can be found here - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

states_area
StateArea_InSquareKM
0Haryana44212
1Punjab50362
2Uttar Pradesh243290
3Bihar94165
```python states_codes ```
StateCode
0HaryanaHR
1PunjabPB
2RajasthanRJ
3Uttar PradeshUP
4Madhya PradeshMP
-- .merge : For column(s)-on-columns(s) operations -- .join : Join DataFrames using their indexes., if need to be on specific keys, then set keys to be the index
states_area.join(states_literacyrate,lsuffix='_FromArea',rsuffix='_FromLiteracy', how='outer')
State_FromAreaArea_InSquareKMState_FromLiteracyYearLiteracy Rate
0Haryana44212.0Haryana201176.64
1Punjab50362.0Rajasthan201167.06
2Uttar Pradesh243290.0Uttar Pradesh201169.72
3Bihar94165.0Haryana200167.91
4NaNNaNUttar Pradesh200156.27
5NaNNaNRajasthan200160.41
```python pd.merge(states_area,states_literacyrate, left_index=True, right_index=True, how='outer') ```
State_xArea_InSquareKMState_yYearLiteracy Rate
0Haryana44212.0Haryana201176.64
1Punjab50362.0Rajasthan201167.06
2Uttar Pradesh243290.0Uttar Pradesh201169.72
3Bihar94165.0Haryana200167.91
4NaNNaNUttar Pradesh200156.27
5NaNNaNRajasthan200160.41
```python ## merge on multiple columns ```
city_states_literacy = pd.read_csv('City_States_Literacy.csv')
city_states_area = pd.read_csv('City_States_Area.csv')
city_states_area
StateCityArea_InSquareKM
0Haryana1983
1Haryana2886
2Rajasthan1881
3Rajasthan2980
4Rajasthan3895
5Rajasthan41010
6Rajasthan51075
7Bihar1802
8Bihar2859
9Bihar31020
10Uttar Pradesh4945
11Uttar Pradesh5787
12Madhya Pradhes6983
```python results_inner = pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City']) results_inner ```
StateCityLiteracy RateArea_InSquareKM
0Haryana174.78983
1Haryana263.80886
2Rajasthan161.68881
3Rajasthan268.61980
4Rajasthan370.78895
5Rajasthan475.801010
6Rajasthan580.651075
7Uttar Pradesh473.77945
8Uttar Pradesh560.61787
```python pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City']) ```
StateCityLiteracy RateArea_InSquareKM
0Haryana174.78983
1Haryana263.80886
2Rajasthan161.68881
3Rajasthan268.61980
4Rajasthan370.78895
5Rajasthan475.801010
6Rajasthan580.651075
7Uttar Pradesh473.77945
8Uttar Pradesh560.61787
---

Related Posts

Read data into pandas

December 23, 2018

Read More
Subsetting a dataframe in pandas

January 5, 2019

Read More
Connect to azure datalake store using python

December 20, 2018

Read More