Titanic Survival Analysis | M Hendra Herviawan

Sat 22 July 2017
Data Science
M Hendra Herviawan
#R

set.seed(1111)
library('dplyr') 
library('readr') 
library('pander')
library('ggplot2')
library('ggthemes') 
library('VIM') 
library('caret')
library('rpart')
library('rpart.plot')
library('glmnet')
library('xgboost')

#Dataset download from https://www.kaggle.com/c/titanic/data
train <- read_csv('_dataset//train.csv')
test  <- read_csv('_dataset//test.csv')

train[1:5]

## 
## -------------------------------------------------------------------------------
##  PassengerId   Survived   Pclass               Name                Sex     Age 
## ------------- ---------- -------- ------------------------------ -------- -----
##       1           0         3        Braund, Mr. Owen Harris       male    22  
## 
##       2           1         1       Cumings, Mrs. John Bradley    female   38  
##                                      (Florence Briggs Thayer)                  
## 
##       3           1         3         Heikkinen, Miss. Laina      female   26  
## 
##       4           1         1      Futrelle, Mrs. Jacques Heath   female   35  
##                                          (Lily May Peel)                       
## 
##       5           0         3        Allen, Mr. William Henry      male    35  
## -------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------
##  SibSp   Parch        Ticket        Fare    Cabin   Embarked 
## ------- ------- ------------------ ------- ------- ----------
##    1       0        A/5 21171       7.25     NA        S     
## 
##    1       0         PC 17599       71.28    C85       C     
## 
##    0       0     STON/O2. 3101282   7.925    NA        S     
## 
##    1       0          113803        53.1    C123       S     
## 
##    0       0          373450        8.05     NA        S     
## -------------------------------------------------------------

1. Can you explain what are the characteristics of this analysis?

It is a type of supervised learning, a method of machine learning where the categories are predefined, and is used to categorize new probabilistic observations into a categories. When there are only two categories the problem is known as binary classification. Binary classification involves classifying the data into two groups (Survived vs Non Survived), e.g. whether or not a customer buys a particular product or not (Yes/No), based on independent variables such as gender, age, location etc.

2. What are the informations in data that can be considered as features?

In general we can categorize feature as structured vs unstructure. structured data is comprised of clearly defined data types (numeric, string, category, etc) whose pattern makes them easily understood. while unstructured data – “everything else” – is comprised of data that is usually not as easily understood and need preprocessing to gain understanding.

summary(combine)

##   PassengerId   Survived   Pclass      Name               Sex     
##  Min.   :   1   0   :549   1:323   Length:1309        female:466  
##  1st Qu.: 328   1   :342   2:277   Class :character   male  :843  
##  Median : 655   NA's:418   3:709   Mode  :character               
##  Mean   : 655                                                     
##  3rd Qu.: 982                                                     
##  Max.   :1309                                                     
##                                                                   
##       Age            SibSp            Parch          Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.000   Length:1309       
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :28.00   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :29.88   Mean   :0.4989   Mean   :0.385                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000                     
##  Max.   :80.00   Max.   :8.0000   Max.   :9.000                     
##  NA's   :263                                                        
##       Fare            Cabin           Embarked  
##  Min.   :  0.000   Length:1309        C   :270  
##  1st Qu.:  7.896   Class :character   Q   :123  
##  Median : 14.454   Mode  :character   S   :914  
##  Mean   : 33.295                      NA's:  2  
##  3rd Qu.: 31.275                                
##  Max.   :512.329                                
##  NA's   :1

In the Titanic dataset PassengerId, Name, Ticket can be considered as unstructure becouse we need to preprocessing to gain understanding what is the meaning behind sequence of digit/char.

To help examine importance feature in structured data we can use decision tree to build feature importance. importance provides a score that indicates how useful or valuable each feature was in the construction of the decision trees.

#Use Decision Tree to get importance feature
x <- select(train, -c(PassengerId, Name, Ticket, Cabin))
Model_DT=rpart(Survived~.,data=x,method="class")
rpart.plot(Model_DT,extra =  3,fallen.leaves = T)

3. Do you need to create any new features?

When looking at the decision tree above it appears that sex, age & pclass are the 3 most important features. When creating the Decison tree we only can use structured data becouse limitation of decision tree algoritma.

head(combine$Name)

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

If we examine closely Name Feature it has pattern "last name"+"Title"+"First name". We can use title as category feature because title represents sex, age & pclass/wealth the 3 most importance feature based on decision tree.

names <- combine$Name
Title <-  gsub("^.*, (.*?)\\..*$", "\\1", names)
combine$Title <- Title

combine$Title[combine$Title == 'Mlle']        <- 'Miss' 
combine$Title[combine$Title == 'Ms']          <- 'Miss'
combine$Title[combine$Title == 'Mme']         <- 'Mrs' 
combine$Title[combine$Title == 'Lady']          <- 'Miss'
combine$Title[combine$Title == 'Dona']          <- 'Miss'

# Titles with very low cell counts to be combined to "others" level
others_Title <- c('the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

combine$Title[combine$Title %in% others_Title]  <- 'Others'
combine$Title <- factor(combine$Title)
#table(combine$Title)

Lets check who among Mr, Master, Miss having a better survival rate

 ggplot(combine[1:891,],aes(x = Title,fill=factor(Survived))) +
  geom_bar() +
  ggtitle("Title vs Survival rate")+
  xlab("Title") +
  ylab("Total Count") +
  labs(fill = "Survived")

4. Do you need to scale any of the features?

We need to do scale or normalization when there is class imbalance or use algoritma that sensitive to data skewness like Neural Network/Deeplearning. In this assisgment we dont need to to data scaling becouse we will use algortima that based on decision tree.

5. Please recap the missing values on the dataset, What will you do with the missing data?

aggr(combine, prop = FALSE, combined = TRUE, numbers = TRUE, sortVars = TRUE, sortCombs = TRUE)

## 
##  Variables sorted by number of missings: 
##     Variable Count
##        Cabin  1014
##     Survived   418
##          Age   263
##     Embarked     2
##         Fare     1
##  PassengerId     0
##       Pclass     0
##         Name     0
##          Sex     0
##        SibSp     0
##        Parch     0
##       Ticket     0
##        Title     0

5.1 Cabin

We will delete this feature becouse 1014 is missing from 1309.

5.2 Embarked

filter(combine, is.na(Embarked))

## 
## ----------------------------------------------------------------------------
##  PassengerId   Survived   Pclass             Name               Sex     Age 
## ------------- ---------- -------- --------------------------- -------- -----
##      62           1         1         Icard, Miss. Amelie      female   38  
## 
##      830          1         1      Stone, Mrs. George Nelson   female   62  
##                                         (Martha Evelyn)                     
## ----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ----------------------------------------------------------
##  SibSp   Parch   Ticket   Fare   Cabin   Embarked   Title 
## ------- ------- -------- ------ ------- ---------- -------
##    0       0     113572    80     B28       NA      Miss  
## 
##    0       0     113572    80     B28       NA       Mrs  
## ----------------------------------------------------------

PassengerId 62 & 830 have same Ticket, Fare and Cabin number, Look like they came from same Embarked. We can use median Fare for every Embarked to fill this missing value.

# Get rid of our missing passenger IDs
embark_fare <- combine %>%
  filter(PassengerId != 62 & PassengerId != 830)

# Use ggplot2 to visualize embarkment, passenger class, & median fare
ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_boxplot() +
  geom_hline(aes(yintercept=80), 
    colour='red', linetype='dashed', lwd=2) +
  scale_y_continuous() +
  theme_few()

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Since their fare was $80 for 1st class, they most likely embarked from 'C'

combine$Embarked[c(62, 830)] <- 'C'

5.3 Fare

We will use media Fare group by Pcclass to fill missing Fare. Fare and PCClass have strong relationship.

filter(combine, is.na(Fare))

## 
## ----------------------------------------------------------------------------
##  PassengerId   Survived   Pclass          Name          Sex    Age    SibSp 
## ------------- ---------- -------- -------------------- ------ ------ -------
##     1044          NA        3      Storey, Mr. Thomas   male   60.5     0   
## ----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## --------------------------------------------------
##  Parch   Ticket   Fare   Cabin   Embarked   Title 
## ------- -------- ------ ------- ---------- -------
##    0      3701     NA     NA        S        Mr   
## --------------------------------------------------

med_fare_3 <- combine %>%
  filter(!is.na(Fare)) %>%
  group_by(Pclass) %>% 
  summarise(med_fare = median(Fare)) %>%
  filter(Pclass == 3) %>%
  .$med_fare

combine <- combine %>%
  mutate(Fare = case_when(
    is.na(Fare) ~ med_fare_3,
    TRUE ~ Fare
  ))

5.4 Age

We will use media age from other passanger group by Tittle, Age have strong relationship with tittle.

group_by(combine, Title) %>%
        summarise(median = median(Age, na.rm = TRUE))

## 
## -----------------
##  Title    median 
## -------- --------
##  Master     4    
## 
##   Miss      22   
## 
##    Mr       29   
## 
##   Mrs       35   
## 
##  Others     48   
## -----------------

combine <- combine %>%
              group_by(Title) %>%
              mutate(Age = ifelse(is.na(Age), round(median(Age, na.rm = TRUE), 1), Age))

6. what are first three algorithms that you will to create prediction model for this dataset?

Decision Tree
Logistic Regression
XGBoost

6.1 Decision Tree

A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

Model_DT=rpart(Survived~.,data=train_val,method="class")
rpart.plot(Model_DT,extra =  3,fallen.leaves = T)

PRE_TDT=predict(Model_DT,data=train_val,type="class")
#confusionMatrix(PRE_TDT,train_val$Survived)

confusionMatrix(PRE_TDT,train_val$Survived)$overall[1]

##  Accuracy 
## 0.8417367

6.2 Logistic Regression

We can also use logistic regression to understand the functional relationship between the independent variables and the dependent variable, to try to understand what might cause the probability of the dependent variable to change

log.mod <- glm(Survived ~ ., family = binomial(link=logit), 
               data = train_val)
###Check the summary
summary(log.mod)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = logit), 
##     data = train_val)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4137  -0.5832  -0.3970   0.5420   2.4913  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  18.084814 535.412123   0.034   0.9731    
## Pclass2      -0.807673   0.357734  -2.258   0.0240 *  
## Pclass3      -1.919969   0.356317  -5.388 7.11e-08 ***
## Sexmale     -14.012905 535.411648  -0.026   0.9791    
## Age          -0.026667   0.010885  -2.450   0.0143 *  
## SibSp        -0.567958   0.142798  -3.977 6.97e-05 ***
## Parch        -0.332783   0.146378  -2.273   0.0230 *  
## Fare          0.006090   0.003488   1.746   0.0808 .  
## EmbarkedQ     0.041227   0.443593   0.093   0.9260    
## EmbarkedS    -0.366009   0.271234  -1.349   0.1772    
## TitleMiss   -14.725697 535.411961  -0.028   0.9781    
## TitleMr      -3.482775   0.622842  -5.592 2.25e-08 ***
## TitleMrs    -13.884327 535.412100  -0.026   0.9793    
## TitleOthers  -3.799501   0.933596  -4.070 4.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 593.40  on 700  degrees of freedom
## AIC: 621.4
## 
## Number of Fisher Scoring iterations: 12

test.probs <- predict(log.mod, newdata=test_val,type =  "response")
table(test_val$Survived,test.probs>0.5)

##    
##     FALSE TRUE
##   0    98   11
##   1    16   52

print ("Accuracy: "); (96+49) / (96+13+19+49)

## [1] "Accuracy: "
## [1] 0.819209

6.3 XGBoost

XGBoost is used in a number of winning Kaggle solutions. XGBoost employs a number of tricks that make it faster and more accurate than traditional Algoritma. Benefit of using ensembles of decision tree methods like XGBoost is that they can automatically provide estimates of feature importance from a trained predictive model.

# Params for xgboost
param <- list(booster = "gbtree",
              #eval_metric = "auc", 
              objective = "binary:logistic",
              eta = .11,
              gamma = 1,
              max_depth = 6,
              min_child_weight = 1,
              subsample = .7,
              colsample_bytree = .7)

xgb_model <- xgb.train(data = dtrain,
                       params = param,
                       watchlist = list(train = dtrain, test = dtest),
                       nrounds = 72,
                       verbose = 1,
                       print_every_n = 9)

## [1]  train-error:0.142857    test-error:0.169492 
## [10] train-error:0.121849    test-error:0.158192 
## [19] train-error:0.116246    test-error:0.175141 
## [28] train-error:0.105042    test-error:0.158192 
## [37] train-error:0.099440    test-error:0.169492 
## [46] train-error:0.100840    test-error:0.158192 
## [55] train-error:0.098039    test-error:0.163842 
## [64] train-error:0.096639    test-error:0.152542 
## [72] train-error:0.093838    test-error:0.163842

# Get the feature real names
names <- dimnames(select(train_val, -c(Survived)))[[2]]

# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb_model)

# Plotting
xgb.plot.importance(importance_matrix)

7. Do you notice any outliers or noise in the data?

There is no outlier in this dataset, there is passanger that buy ticket for $500 but I think its still normal becouse he/she in claass 1. To examine outliyer we can use box plot to explore.

ggplot(train, aes(Pclass, Fare, colour = Survived)) +
  geom_boxplot()

8. How do you measure the quality of your prediction?

binary classification model predicts the probability of a target variable to be Yes/No. To evaluate such a model, a metric called the confusion matrix is used, also called the classification or co-incidence matrix. With the help of a confusion matrix, we can calculate important performance measures

confusionMatrix(PRE_TDT,train_val$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 403  76
##          1  37 198
##                                           
##                Accuracy : 0.8417          
##                  95% CI : (0.8129, 0.8678)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6562          
##  Mcnemar's Test P-Value : 0.0003506       
##                                           
##             Sensitivity : 0.9159          
##             Specificity : 0.7226          
##          Pos Pred Value : 0.8413          
##          Neg Pred Value : 0.8426          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5644          
##    Detection Prevalence : 0.6709          
##       Balanced Accuracy : 0.8193          
##                                           
##        'Positive' Class : 0               
##

9. Are there any insights you get when you explore the data?

If you are man, dont be 3rd class persone becouse you probably will die in competition

ggplot(combine[1:891,], aes(x = Sex, fill = Survived)) +
geom_bar() +
facet_wrap(~Pclass) + 
ggtitle("sex, pclass, and survival") +
xlab("Sex") +
ylab("Total Count") +
labs(fill = "Survived")