As many of you are aware, Kaggle is one of the most sought after data science platforms that hosts competitions to understand the concepts of machine learning and is also a medium where monetary prizes are offered to solve real life issues.
I recently participated in the Titanic module to predict the survival in the test cohort. Technically speaking, there was nothing new that I learnt from this exercise. I implemented the same machine learning models to obtain an AUC and assess the performance of various models. I will walk you through the code I generated for this competition. But what I did learn is the significance of data cleaning and feature engineering.
Spend some time to understand and clean your dataset.
I can’t emphasize enough on the importance and impact of a clean dataset.
* What are the features which represent this dataset?
The detailed explanation of this dataset and the features can be found at https://www.kaggle.com/c/titanic/data
* What features are numerical/categorical?
Numerical – Pclass, Age, SibSp, Parch, Fare
Categorical – Sex, Embarkment
* What is the ratio of your dataset samples to the features ? You do not want to overfit your model. I have noticed that in real life examples (which are different than what I extract in my lab), you usually will have a good ratio of data samples to features. So feature selection is not a major concern.
* Real life data always has missing values which need to be taken care of. Imputation is necessary to interpolate the missing values. Please keep in mind that the way you interpolate your numerical features is different from your categorical features.
I extensively worked on cleaning up the feature matrix. A quick look and I realized that I didn’t want to pursue the features ‘Name’, ‘Ticket’ and ‘Cabin’. I filled the missing age values by taking the mean of the feature column. I tackled the categorical missing values of Embarkment by substituting with the most frequently occurring cases.
* A quick correlation analysis within the training set will tell what features are worth pursuing and cleaning up.
We can estimate that the survival was more favorable for a wealthy female
# -*- coding: utf-8 -*-
@author: nihabeig (forked from David Retana, Kaggle)
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
dataset_train = pd.read_csv("~/Kaggle/Titanic/train.csv")
y = dataset_train['Survived']
dataset_train = dataset_train.drop('Survived', 1)
dataset_test = pd.read_csv("~/Kaggle/Titanic/test.csv")
dataset = pd.concat([dataset_train, dataset_test], axis=0)
numerical_columns = ['PassengerId','Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
categorical_columns = ['Sex', 'Embarked']
dataset_numerical = dataset[numerical_columns]
dataset_categorical = dataset[categorical_columns]
dataset_numer = dataset_numerical.fillna(dataset_numerical.median())
scaler=MinMaxScaler(feature_range=(0, 1), copy=True);
dataset_numerical.columns = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
# Preprocessing categorical columns
reqvalue=" ".join(str(x) for x in checkfrq)
dataset_categorical = dataset_categorical.fillna(reqvalue)
dataset_cat1 = pd.get_dummies(dataset_categorical,columns=["Sex","Embarked"])
clf = RandomForestClassifier(n_estimators=1000, random_state=np.random.RandomState(0),max_depth=20 )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
confusion_train= confusion_matrix(y_test, prediction_train)
result = pd.DataFrame( predictions, index=dataset_test['PassengerId'],columns=['Survived'])
My Training AUC using the RF classifier was 0.847. When submitted on the test set, I achieved an AUC of 0.7599. Long way to go!
While you are at it, please check out this amazing tableau dashboard created for the training dataset of this exercise!