Titanic EDA

Titanic EDA

September 27, 2022 539 read

The sinking of the Titanic is one of the most notorious shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. 

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

 

check this link to see my original codes on kaggle.com.

 

 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.style.use("seaborn-pastel")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

 

1. LOAD AND CHECK DATA

train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")
test_PassengerID = test_df["PassengerId"]
train_df.columns

out: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],      dtype='object')

train_df.head()
train_df.describe()

 

2. VARIABLE DESCRIPTION

PassengerId: unique id number for each passenger
Survived: there are 0 and 1, if variable is 0 so passenger was death, else 1 so survived.
Pclass: passenger class
Name: name
Sex: gender of passenger
Age: age of passenger
SibSp: number of siblings/spouses
Parch: number of parents/children
Ticket: number of ticket
Fare: price of ticket
Cabin: the category of cabinet
Embarke: ports where passenger embarked (C: Cherbourg, Q:Queenstown, S=Southampton)
 

train_df.info()

  • float64(2) - fare and age
  • int64(5) - PassengerId, Survived, Pclass, SibSp, Parch
  • object(5) - name, sex, ticket, cabin, embarked
     

 

2.1. Univariate Variable Analysis

- Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, SibSp and Parch

- Numerical Variable: Age, PassengerId, Fare

 

2.1.1. Categorical Variable

def bar_plot(variable):
    """
        input: variable ex:"Sex"
        output: bar plot & value count
    """
    #get features
    var = train_df[variable]
    #count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable, varValue))
category1 = ["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
for c in category1:
    bar_plot(c)

 

  

 

In Survived histogram, it is obvious that 549 passenger did not survive, at the same time 342 people survived. Of the 551 passengers, 549 did not survive and 342 survived. Of the 551 passengers, 549 did not survive and 342 survived. These are not half, so the survive dataset is not a balanced dataset.

Of the 891 passengers, 577 were men and 314 were women. This data set has an uneven distribution. By looking at this data set, one can have the idea that a passenger is an average of 1/3 male by looking at the male to female ratio.

There are three classes in Pclass data. There are 216 passengers in the first class, 184 in the second class and 491 in the third class.

The relationship between Embarked and Pclass will be examined.

 

category2 = ["Cabin", "Name", "Ticket"]
for c in category2:
    print("{} \n".format(train_df[c].value_counts()))

2.1.2. Numerical Variable

def plot_histogram(variable):
    plt.figure(figsize= (9,3))
    plt.hist(train_df[variable], bins=50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()
numericVar=["Fare", "Age", "PassengerId"]
for n in numericVar:
    plot_histogram(n)

 

In Fare distribution with his, the price can be seen above 100 that is mostly paid. There is an option for who paid 500 means he/she wa rich or paid someone else ticket.

In age distribution, the consistency is stayed ages that are in between 20 and 30. Also, the number of kids are also quite high.

 

3. BASIC DATA ANALYSIS

  • Pclass vs Survived
  • Sex vs Survived
  • SibSp vs Survived
  • Parch vs Survived

 

# pclass vs survived

train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by="Survived", ascending = False)
 PclassSurvived
010.629630
120.472826
230.242363

Passengers who bought first class tickets were more likely to survive than other class. That is %62. The second class percent is %47, the third class is %24.

 

# Sex vs survived

train_df[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by="Survived", ascending = False)
 SexSurvived
0female0.742038
1male0.188908

 

# SibSp vs survived

train_df[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by="Survived", ascending = False)
 SibSpSurvived
110.535885
220.464286
000.345395
330.250000
440.166667
550.000000
680.000000

Survival rate of those with a sibling with them is the highest with 53.6%. The odds of survival for those with no one are in the third place with 34%. Survival rate is very low for those who have more than two people with them.

 

# Parch vs survived

train_df[["Parch", "Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by="Survived", ascending = False)
 ParchSurvived
330.600000
110.550847
220.500000
000.343658
550.200000
440.000000
660.000000

If there is a family or a child with them, that is, 3 people in total, the probability of survival is around 50%. However, as this number increases, the survival rate decreases.

 

4. OUTLINER DETECTION

def detect_outliers(df,features):
    outliner_indexes = []
    
    for c in features:
        #1st quartile
        Q1 = np.percentile(df[c], 25)
        
        #3rd quartile
        Q3 = np.percentile(df[c], 75)
        
        #IQR
        IQR = Q1 - Q3
        
        #Outliner ster
        outliner_step = IQR * 1.5
        
        #Detect outliner and their indexes
        outliner_list_col = df[(df[c] < Q1 - outliner_step) | (df[c] > Q3 + outliner_step)].index
        
        #store indexes
        outliner_indexes.extend(outliner_list_col)
    
    outliner_indexes = Counter(outliner_indexes)
    multiple_outliers = list(i for i, v in outliner_indexes.items() if v > 2)
    
    return multiple_outliers
train_df.loc[detect_outliers(train_df, ["Age", "SibSp", "Parch", "Fare"])]
#drop outliners
train_df = train_df.drop(detect_outliers(train_df, ["Age", "SibSp", "Parch", "Fare"]), axis=0).reset_index ( drop = True)

 

4. MISSING VALUE

  • find missing value
  • fill missing value
train_df_len = len(train_df)
train_df = pd.concat([train_df, test_df], axis = 0).reset_index(drop = True)
train_df.head()

 

4.1. Find Missing Value

#finding which columns contain null
train_df.columns[train_df.isnull().any()]

Index(['Survived', 'Age', 'Fare', 'Cabin', 'Embarked'], dtype='object')

#how many nulls that those columns have
train_df.isnull().sum()

PassengerId      0 
Survived       418 
Pclass           0 
Name             0 
Sex              0 
Age            243 
SibSp            0 
Parch            0 
Ticket           0 
Fare             1 
Cabin          864 
Embarked         2 
dtype: int64

 

4.2. Fill Missing Value

  1. embarked has 2 missing value
  2. Fare has only 1
train_df[train_df["Embarked"].isnull()]

 

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
48621.01Icard, Miss. Ameliefemale38.00011357280.0B28NaN
6338301.01Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN

Embarked can be charged by comparison for filling. For example, you can see where the first class passengers are getting on, or fare and fill them accordingly.

 

train_df.boxplot(column="Fare", by="Embarked")
plt.show

<function matplotlib.pyplot.show(close=None, block=None)>

The median of those who embark from port C is closer to eighty than the others, because the fare for empty ones is 80. hence the blank data will be filled as C.

train_df["Embarked"] = train_df["Embarked"].fillna("C")
train_df[train_df["Embarked"].isnull()]
train_df[train_df["Fare"].isnull()]
train_df["Fare"] = train_df["Fare"].fillna(np.mean(train_df[train_df["Pclass"] == 3]["Fare"]))
train_df[train_df["Fare"].isnull()]