Part 0 - Plotting Using Seaborn - Data Preparation

  20 Aug 2019
  python, visualisation

Import Preliminaries and datasets

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as plb
import warnings
warnings.filterwarnings('ignore')

test_scores = pd.read_csv("Data/Test scores.csv", parse_dates=['Test taken date'])
test_master = pd.read_csv("Data/Test master.csv")
test_participant = pd.read_csv("Data/Audience summary.csv")

We have three datasets, namely -

Test Scores Dataset

This contains scores of each particpant in the test they appeared.

test_scores.head()
Participant identifierTest NameTest taken dateTrackDesignationScore
037MCTMIf conditional2018-11-23EngineeringLead18
137MCTMDeterminers and Quantifiers2018-11-23EngineeringLead28
237MCTMModals2018-11-23EngineeringLead22
337MCTMTenses2018-11-13EngineeringLead12
437MCTMPronouns2018-11-13EngineeringLead15

Test Master

This is about the other details associated with each test.

test_master
Test nameNo. of questionsComplexityMarks per question
0Articles-New15Easy1
1Tenses15Easy1
2Pronouns15Easy1
3Articles15Easy1
4Conjuctions15Easy1
5Adjective & Adverb15Easy1
6Active and passive voice15Medium2
7Puctuations15Medium2
8If conditional15Medium2
9Determiners and Quantifiers15Medium2
10Modals15Medium2
11Prepositions15Medium2
12Comprehension10Difficult3
13Confusing words15Difficult3
14Synonyms & Antonyms15Difficult3
15Vocabulary15Difficult3
16Capitalization15Difficult3

Test Participants

This is abouth the other details associated with the pariticipants.

test_participant
DesignationEngineeringQuality AssuranceSupport
0Associate1400250.0220
1Lead1800400.0100
2Manager30060.070
3Consultant200NaN10
4Associate Director and above6005.032

We will create more metrics in the dataset provided so that it would be easy to analyse and compare across multiple factors, like -

  • Weekday
  • Week No.
  • Month of the test taken date
  • Maximum Score can be obtained
  • Percentage of marks obtained by the participants
test_scores['weekday_name']  = test_scores['Test taken date'].dt.weekday_name
test_scores['month']  = test_scores['Test taken date'].dt.month_name() 
test_scores['week']  = test_scores['Test taken date'].dt.week-42 # to get number from 1 
test_master['maximum_score'] = test_master['No. of questions'] * test_master['Marks per question']
test_scores = pd.merge(test_scores,test_master,left_on="Test Name", right_on="Test name", how = "left")
cols = ['Participant identifier', 'Test Name', 'Track','Designation', 'Score', 
        'weekday_name', 'month', 'week','Complexity', 'maximum_score']
test_scores = test_scores[cols]
test_scores['Percent'] = round((test_scores['Score']/test_scores['maximum_score'])*100,2)
test_scores.head()
Participant identifierTest NameTrackDesignationScoreweekday_namemonthweekComplexitymaximum_scorePercent
037MCTMIf conditionalEngineeringLead18FridayNovember5Medium3060.00
137MCTMDeterminers and QuantifiersEngineeringLead28FridayNovember5Medium3093.33
237MCTMModalsEngineeringLead22FridayNovember5Medium3073.33
337MCTMTensesEngineeringLead12TuesdayNovember4Easy1580.00
437MCTMPronounsEngineeringLead15TuesdayNovember4Easy15100.00

Now we are ready to visualise this data for better analysis.
The first post in the series is - Part 1 - Plotting Using Seaborn - Violin, Box and Line Plot