Part 0 - Plotting Using Seaborn - Data Preparation

Import Preliminaries and datasets

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as plb
import warnings
warnings.filterwarnings('ignore')

test_scores = pd.read_csv("Data/Test scores.csv", parse_dates=['Test taken date'])
test_master = pd.read_csv("Data/Test master.csv")
test_participant = pd.read_csv("Data/Audience summary.csv")

We have three datasets, namely -

Test Scores Dataset

This contains scores of each particpant in the test they appeared.

test_scores.head()

	Participant identifier	Test Name	Test taken date	Track	Designation	Score
0	37MCTM	If conditional	2018-11-23	Engineering	Lead	18
1	37MCTM	Determiners and Quantifiers	2018-11-23	Engineering	Lead	28
2	37MCTM	Modals	2018-11-23	Engineering	Lead	22
3	37MCTM	Tenses	2018-11-13	Engineering	Lead	12
4	37MCTM	Pronouns	2018-11-13	Engineering	Lead	15

Test Master

This is about the other details associated with each test.

test_master

	Test name	No. of questions	Complexity	Marks per question
0	Articles-New	15	Easy	1
1	Tenses	15	Easy	1
2	Pronouns	15	Easy	1
3	Articles	15	Easy	1
4	Conjuctions	15	Easy	1
5	Adjective & Adverb	15	Easy	1
6	Active and passive voice	15	Medium	2
7	Puctuations	15	Medium	2
8	If conditional	15	Medium	2
9	Determiners and Quantifiers	15	Medium	2
10	Modals	15	Medium	2
11	Prepositions	15	Medium	2
12	Comprehension	10	Difficult	3
13	Confusing words	15	Difficult	3
14	Synonyms & Antonyms	15	Difficult	3
15	Vocabulary	15	Difficult	3
16	Capitalization	15	Difficult	3

Test Participants

This is abouth the other details associated with the pariticipants.

test_participant

	Designation	Engineering	Quality Assurance	Support
0	Associate	1400	250.0	220
1	Lead	1800	400.0	100
2	Manager	300	60.0	70
3	Consultant	200	NaN	10
4	Associate Director and above	600	5.0	32

We will create more metrics in the dataset provided so that it would be easy to analyse and compare across multiple factors, like -

Weekday
Week No.
Month of the test taken date
Maximum Score can be obtained
Percentage of marks obtained by the participants

test_scores['weekday_name']  = test_scores['Test taken date'].dt.weekday_name
test_scores['month']  = test_scores['Test taken date'].dt.month_name() 
test_scores['week']  = test_scores['Test taken date'].dt.week-42 # to get number from 1 
test_master['maximum_score'] = test_master['No. of questions'] * test_master['Marks per question']
test_scores = pd.merge(test_scores,test_master,left_on="Test Name", right_on="Test name", how = "left")
cols = ['Participant identifier', 'Test Name', 'Track','Designation', 'Score', 
        'weekday_name', 'month', 'week','Complexity', 'maximum_score']
test_scores = test_scores[cols]
test_scores['Percent'] = round((test_scores['Score']/test_scores['maximum_score'])*100,2)

test_scores.head()

	Participant identifier	Test Name	Track	Designation	Score	weekday_name	month	week	Complexity	maximum_score	Percent
0	37MCTM	If conditional	Engineering	Lead	18	Friday	November	5	Medium	30	60.00
1	37MCTM	Determiners and Quantifiers	Engineering	Lead	28	Friday	November	5	Medium	30	93.33
2	37MCTM	Modals	Engineering	Lead	22	Friday	November	5	Medium	30	73.33
3	37MCTM	Tenses	Engineering	Lead	12	Tuesday	November	4	Easy	15	80.00
4	37MCTM	Pronouns	Engineering	Lead	15	Tuesday	November	4	Easy	15	100.00

Now we are ready to visualise this data for better analysis.
The first post in the series is - Part 1 - Plotting Using Seaborn - Violin, Box and Line Plot

Import Preliminaries and datasets

Test Scores Dataset

Test Master

Test Participants

Related Posts