Titanic EDA

September 27, 2022 — 976 read

The sinking of the Titanic is one of the most notorious shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

check this link to see my original codes on kaggle.com.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.style.use("seaborn-pastel")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

1. LOAD AND CHECK DATA

train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")
test_PassengerID = test_df["PassengerId"]

train_df.columns

out: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

train_df.head()

train_df.describe()

2. VARIABLE DESCRIPTION

PassengerId: unique id number for each passenger
Survived: there are 0 and 1, if variable is 0 so passenger was death, else 1 so survived.
Pclass: passenger class
Name: name
Sex: gender of passenger
Age: age of passenger
SibSp: number of siblings/spouses
Parch: number of parents/children
Ticket: number of ticket
Fare: price of ticket
Cabin: the category of cabinet
Embarke: ports where passenger embarked (C: Cherbourg, Q:Queenstown, S=Southampton)

train_df.info()

float64(2) - fare and age
int64(5) - PassengerId, Survived, Pclass, SibSp, Parch
object(5) - name, sex, ticket, cabin, embarked

2.1. Univariate Variable Analysis

- Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, SibSp and Parch

- Numerical Variable: Age, PassengerId, Fare

2.1.1. Categorical Variable

def bar_plot(variable):
    """
        input: variable ex:"Sex"
        output: bar plot & value count
    """
    #get features
    var = train_df[variable]
    #count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable, varValue))

category1 = ["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
for c in category1:
    bar_plot(c)

In Survived histogram, it is obvious that 549 passenger did not survive, at the same time 342 people survived. Of the 551 passengers, 549 did not survive and 342 survived. Of the 551 passengers, 549 did not survive and 342 survived. These are not half, so the survive dataset is not a balanced dataset.

Of the 891 passengers, 577 were men and 314 were women. This data set has an uneven distribution. By looking at this data set, one can have the idea that a passenger is an average of 1/3 male by looking at the male to female ratio.

There are three classes in Pclass data. There are 216 passengers in the first class, 184 in the second class and 491 in the third class.

The relationship between Embarked and Pclass will be examined.

category2 = ["Cabin", "Name", "Ticket"]
for c in category2:
    print("{} \n".format(train_df[c].value_counts()))

2.1.2. Numerical Variable

def plot_histogram(variable):
    plt.figure(figsize= (9,3))
    plt.hist(train_df[variable], bins=50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

numericVar=["Fare", "Age", "PassengerId"]
for n in numericVar:
    plot_histogram(n)

In Fare distribution with his, the price can be seen above 100 that is mostly paid. There is an option for who paid 500 means he/she wa rich or paid someone else ticket.

In age distribution, the consistency is stayed ages that are in between 20 and 30. Also, the number of kids are also quite high.

3. BASIC DATA ANALYSIS

Pclass vs Survived
Sex vs Survived
SibSp vs Survived
Parch vs Survived

# pclass vs survived

train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by="Survived", ascending = False)

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

Passengers who bought first class tickets were more likely to survive than other class. That is %62. The second class percent is %47, the third class is %24.

# Sex vs survived

train_df[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by="Survived", ascending = False)

	Sex	Survived
0	female	0.742038
1	male	0.188908

# SibSp vs survived

train_df[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by="Survived", ascending = False)

	SibSp	Survived
1	1	0.535885
2	2	0.464286
0	0	0.345395
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

Survival rate of those with a sibling with them is the highest with 53.6%. The odds of survival for those with no one are in the third place with 34%. Survival rate is very low for those who have more than two people with them.

# Parch vs survived

train_df[["Parch", "Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by="Survived", ascending = False)

	Parch	Survived
3	3	0.600000
1	1	0.550847
2	2	0.500000
0	0	0.343658
5	5	0.200000
4	4	0.000000
6	6	0.000000

If there is a family or a child with them, that is, 3 people in total, the probability of survival is around 50%. However, as this number increases, the survival rate decreases.

4. OUTLINER DETECTION

def detect_outliers(df,features):
    outliner_indexes = []
    
    for c in features:
        #1st quartile
        Q1 = np.percentile(df[c], 25)
        
        #3rd quartile
        Q3 = np.percentile(df[c], 75)
        
        #IQR
        IQR = Q1 - Q3
        
        #Outliner ster
        outliner_step = IQR * 1.5
        
        #Detect outliner and their indexes
        outliner_list_col = df[(df[c] < Q1 - outliner_step) | (df[c] > Q3 + outliner_step)].index
        
        #store indexes
        outliner_indexes.extend(outliner_list_col)
    
    outliner_indexes = Counter(outliner_indexes)
    multiple_outliers = list(i for i, v in outliner_indexes.items() if v > 2)
    
    return multiple_outliers

train_df.loc[detect_outliers(train_df, ["Age", "SibSp", "Parch", "Fare"])]

#drop outliners
train_df = train_df.drop(detect_outliers(train_df, ["Age", "SibSp", "Parch", "Fare"]), axis=0).reset_index ( drop = True)

4. MISSING VALUE

find missing value
fill missing value

train_df_len = len(train_df)
train_df = pd.concat([train_df, test_df], axis = 0).reset_index(drop = True)

train_df.head()

4.1. Find Missing Value

#finding which columns contain null
train_df.columns[train_df.isnull().any()]

Index(['Survived', 'Age', 'Fare', 'Cabin', 'Embarked'], dtype='object')

#how many nulls that those columns have
train_df.isnull().sum()

PassengerId 0
Survived 418
Pclass 0
Name 0
Sex 0
Age 243
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 864
Embarked 2
dtype: int64

4.2. Fill Missing Value

embarked has 2 missing value
Fare has only 1

train_df[train_df["Embarked"].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
48	62	1.0	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
633	830	1.0	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Embarked can be charged by comparison for filling. For example, you can see where the first class passengers are getting on, or fare and fill them accordingly.

train_df.boxplot(column="Fare", by="Embarked")
plt.show

The median of those who embark from port C is closer to eighty than the others, because the fare for empty ones is 80. hence the blank data will be filled as C.

train_df["Embarked"] = train_df["Embarked"].fillna("C")
train_df[train_df["Embarked"].isnull()]

train_df[train_df["Fare"].isnull()]

train_df["Fare"] = train_df["Fare"].fillna(np.mean(train_df[train_df["Pclass"] == 3]["Fare"]))

train_df[train_df["Fare"].isnull()]