Surviving Titanic Disaster

Posted on February 08, 2018

Titanic: Machine Learning from Disaster is a Kaggle competition about the sinking of the infamous RMS Titanic. The challenge is to predict which passengers survived the disaster.

You can download the training data from here and import the dataset in Stata as a csv-file.

. import delimited train.csv
(12 vars, 891 obs)

From the 12 variables in the dataset I decide to use only 7 of them: survived (binary outcome variable), sex, age, pclass (ticket class), sibsp (number of siblings/spouses), parch (number of parents/children), and fare (passenger fare). I need all of my variables to be numeric, so I encode the string variable sex into a binary variable, bsex.

. egen bsex = group(sex)

Although the training dataset includes a total of 891 passengers, the number of complete records is only 741. Hereafter I simply discard the incomplete records. However, a more comprehensive approach would try to impute the missing values and utilize the available data entirely. For maximum performance, one would use all variables or may construct additional specially designed variables in order to catch more subtle interactions. My goal however is different.

I want to fit and compare two models for predicting passenger survival: a logistic regression model and a multilayer perceptron. It is a comparison of a standard statistical methodology, logistic regression, to a very different analytical tool which is outside the mainstream statistics. Each has its pros and cons.

I use about two-thirds (first 600 records) of the dataset for training and one-third for validation. Then I compare the prediction performance of the two models.

A logistic regression model

Let's first consider a logistic regression model with outcome survived and independent predictor variables sex, age, pclass, sibsp, parch, and fare. Fitting the model with the logit command is straightforward.

. logit survived bsex age pclass sibsp parch fare in 1/600, or

Iteration 0:   log likelihood = -321.42313  
Iteration 1:   log likelihood = -221.62832  
Iteration 2:   log likelihood = -220.06392  
Iteration 3:   log likelihood = -220.05786  
Iteration 4:   log likelihood = -220.05786  

Logistic regression                             Number of obs     =        474
                                                LR chi2(6)        =     202.73
                                                Prob > chi2       =     0.0000
Log likelihood = -220.05786                     Pseudo R2         =     0.3154

------------------------------------------------------------------------------
    survived | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        bsex |   .0738699   .0188864   -10.19   0.000     .0447549    .1219256
         age |   .9633228   .0094899    -3.79   0.000     .9449014    .9821034
      pclass |   .3360143   .0675789    -5.42   0.000     .2265505    .4983685
       sibsp |   .6957472   .1035633    -2.44   0.015      .519695    .9314389
       parch |   1.029911   .1696705     0.18   0.858     .7457102    1.422424
        fare |   .9984295   .0032647    -0.48   0.631     .9920513    1.004849
       _cons |    1872.28   1634.887     8.63   0.000     338.1398    10366.82
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

The results are unsurprising. Age, having a sibling or a spouse, traveling a lower class, and being a man all adversely affect your chances of survival. On the other hand, having a child or a parent increase your chances. Sex and ticket class seem to have the most impact. For example, the odds of survival of men vs. women are only 7 to 100. Among the clear advantages of the logistic regression model are the interpretability of regression coefficients and the rigorous criteria of their statistical significance in terms of Z-test p-values.

The next step is to make predictions based on the just fitted model. I assess the prediction performance on the last third of the dataset, which was not used for training - these are observations from 601 to 891. I use prediction accuracy as a performance criteria.

. predict logitpred
. replace logitpred = floor(0.5+logitpred)
. generate match = logitpred == survived
. summarize match in 601/891

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       match |        291    .6597938    .4745945          0          1

The validation accuracy of the logistic regression model is only about 66%, which is not great. In fact, if we had predicted all passengers to have perished, we would be right in 64% of the cases. So, despite all of the insights it provides, the logistic model has little to offer in terms of effective prediction.

A neural network predictive model

Let's now consider a multilayer perceptron with 2 hidden layers as a simple neural network predictive model and see how it compares to the logistic regression model. I specify the model using the mlp2 command introduced here and let the 2 hidden layers have 10 neurons each. The only additional option I supply is lrate(0.5) to increase the default learning rate of the optimizer.

. set seed 12345
. mlp2 fit survived bsex age pclass sibsp parch fare in 1/600, layer1(10) layer2(10) lrate(0.5)

------------------------------------------------------------------------------
Multilayer perceptron                              input variables =        6
                                                   layer1 neurons  =       10
                                                   layer2 neurons  =       10
Loss: softmax                                      output levels   =        2

Optimizer: sgd                                     batch size      =       50
                                                   max epochs      =      100
                                                   loss tolerance  =    .0001
                                                   learning rate   =       .5

Training ended:                                    epochs          =      100
                                                   start loss      =  .579942
                                                   end loss        =  .355872
------------------------------------------------------------------------------

After 100 optimization steps, the default number of steps, the optimization loss has decreased from about 0.58 to 0.36. Beside the fact that some learning took place, we have little indication of the model fit. My choice of the size of the hidden layers is specifically for the purpose of keeping the network small and avoiding severe overfitting.

Again, I use the last third of the dataset to validate the prediction performance of the model.

. mlp2 predict in 601/891, genvar(ypred)

Prediction accuracy: .8666666666666667

We achieve an accuracy of about 87%, which is notably better than the performance of the logistic regression model. To be fare, having 202 parameters, the neural network is a substantially larger model than the 7-parameter logistic regression.

On the downside, the neural network tells us very little in terms of predictor importance and statistical significance. Except empirical evidence, there are no guarantees that the model is systematically better than logistic regression. That rules out its use in applications where statistical rigor is required.