Getting started: Kaggle's Titanic Competition

Get the data

The purpose of the competition is to predict which passengers have survived or not. The available data has two parts. The first part consists in a data set which contains what happened with some passengers and some related information like sex, cabin, age, class, etc. This data set contains information regarding their survival. The purpose why this data set contains survival data is because it will be used to train a model which learns how to decide if a passenger survives or not. This is the train.csv. The other file is a data set which contains data about another set of passenger, this time without knowing if they survived or not. They contain, however an identification number. This data set is test.csv and this is used to make predictions. Those predictions should be similar with the provided gendermodel.csv.

We also have to take a look of the data description provided on contest dedicated page:

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

The first step in our adventure is to download those 3 data file in csv format. You can do it from data section of the competition. Let's suppose you downloaded somewhere in a local folder. We will name this folder data folder, and actually it can have any name you would like.

Read train data from csv file

Because the data is small we can load the whole data in memory with no problems.

Let's see how we can load the data into memory. In rapaio the sets of data are loaded into the form of frames (rapaio.data.Frame). A frame is basically a tabular data, with columns for each variable (feature) and rows for each instance (in our case for each passenger).

A first try of loading the train data set and see what has happened is the following:

new Csv().read(root + "train.csv").printSummary();

What happened is that the csv reader was instantiated, a frame was loaded from the train.csv file, an instance of data frame was created and a method was called to see a print summary of the loaded frame.

Frame Summary
=============
* rowCount: 891
* complete: 889/891
* varCount: 12
* varNames: 

 0. PassengerId : idx | 4.   Sex : nom |  8.   Ticket : nom |
 1.    Survived : bin | 5.   Age : nom |  9.     Fare : num |
 2.      Pclass : idx | 6. SibSp : idx | 10.    Cabin : nom |
 3.        Name : nom | 7. Parch : idx | 11. Embarked : nom |

      PassengerId    Survived           Pclass 
   Min. :   1.000     0 : 549     Min. : 1.000 
1st Qu. : 223.500     1 : 342  1st Qu. : 2.000 
 Median : 446.000  NA's :   0   Median : 3.000 
   Mean : 446.000                 Mean : 2.309 
2nd Qu. : 668.500              2nd Qu. : 3.000 
   Max. : 891.000                 Max. : 3.000 

                                                       Name           Sex            Age 
                            "Braund, Mr. Owen Harris" :   1    male : 577          : 177 
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" :   1  female : 314       24 :  30 
                             "Heikkinen, Miss. Laina" :   1                     22 :  27 
       "Futrelle, Mrs. Jacques Heath (Lily May Peel)" :   1                     18 :  26 
                           "Allen, Mr. William Henry" :   1                     28 :  25 
                                   "Moran, Mr. James" :   1                     19 :  25 
                                              (Other) : 885                (Other) : 581 
          SibSp            Parch          Ticket               Fare              Cabin 
   Min. : 0.000     Min. : 0.000    347082 :   7     Min. :   0.000              : 687 
1st Qu. : 0.000  1st Qu. : 0.000      1601 :   7  1st Qu. :   7.910           G6 :   4 
 Median : 0.000   Median : 0.000  CA. 2343 :   7   Median :  14.454  C23 C25 C27 :   4 
   Mean : 0.523     Mean : 0.382   3101295 :   6     Mean :  32.204      B96 B98 :   4 
2nd Qu. : 1.000  2nd Qu. : 0.000   CA 2144 :   6  2nd Qu. :  31.000          F33 :   3 
   Max. : 8.000     Max. : 6.000    347088 :   6     Max. : 512.329         E101 :   3 
                                   (Other) : 852                         (Other) : 186 
  Embarked 
   S : 644 
   C : 168 
   Q :  77 
NA's :   2

How can we interpret the output of the frame's summary?

  • We loaded a frame which has rows and columns (variables)
  • From all the rows, are complete (non missing data)
  • The name of the variables are listed, together with their types
  • It follows a data summary for the frame: 6 number summary for numeric variables, most frequent levels for nominal variables

Let's inspect each variable and see how it fits our needs.

PassengedId

The type for this variable is index (integer values). This field looks like an identifier for the passenger, so from our point of view the sorting is not required. What we can do, but is not required, is to change the field type to nominal. Anyway, we do not need this field for learning since it should be unique for each instance, thus the predictive power is null. We will ignore it for now since we will not consider it for learning

Survived

This is our target variable. It is parsed as binary, but since we do classification, we will change it's type to nominal. We do that directly from the csv parsing, by indicating that we want Survived parsed as nominal variable:

new Csv()
        .withTypes(VarType.NOMINAL, "Survived")
        .read(root + "train.csv")
        .printSummary();

The output becomes

Frame Summary
=============
* rowCount: 891
* complete: 889/891
* varCount: 12
* varNames: 

 0. PassengerId : idx | 4.   Sex : nom |  8.   Ticket : nom |
 1.    Survived : nom | 5.   Age : nom |  9.     Fare : num |
 2.      Pclass : idx | 6. SibSp : idx | 10.    Cabin : nom |
 3.        Name : nom | 7. Parch : idx | 11. Embarked : nom |

      PassengerId  Survived           Pclass 
   Min. :   1.000   0 : 549     Min. : 1.000 
1st Qu. : 223.500   1 : 342  1st Qu. : 2.000 
 Median : 446.000             Median : 3.000 
   Mean : 446.000               Mean : 2.309 
2nd Qu. : 668.500            2nd Qu. : 3.000 
   Max. : 891.000               Max. : 3.000 

                                                       Name           Sex            Age 
                            "Braund, Mr. Owen Harris" :   1    male : 577          : 177 
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" :   1  female : 314       24 :  30 
                             "Heikkinen, Miss. Laina" :   1                     22 :  27 
       "Futrelle, Mrs. Jacques Heath (Lily May Peel)" :   1                     18 :  26 
                           "Allen, Mr. William Henry" :   1                     28 :  25 
                                   "Moran, Mr. James" :   1                     19 :  25 
                                              (Other) : 885                (Other) : 581 
          SibSp            Parch          Ticket               Fare              Cabin 
   Min. : 0.000     Min. : 0.000    347082 :   7     Min. :   0.000              : 687 
1st Qu. : 0.000  1st Qu. : 0.000      1601 :   7  1st Qu. :   7.910           G6 :   4 
 Median : 0.000   Median : 0.000  CA. 2343 :   7   Median :  14.454  C23 C25 C27 :   4 
   Mean : 0.523     Mean : 0.382   3101295 :   6     Mean :  32.204      B96 B98 :   4 
2nd Qu. : 1.000  2nd Qu. : 0.000   CA 2144 :   6  2nd Qu. :  31.000          F33 :   3 
   Max. : 8.000     Max. : 6.000    347088 :   6     Max. : 512.329         E101 :   3 
                                   (Other) : 852                         (Other) : 186 
  Embarked 
   S : 644 
   C : 168 
   Q :  77 
NA's :   2

And notice how type of the Survived variable changed to nominal.

Pclass

This variable has index type. We can keep it like it is or we can change it to nominal. Both ways can be useful. For example parsed as index could give an interpretation to the order. We can say that somehow, because of ordering class 1 is lower than class 2, and class 2 is between classes 1 and 3. At the same time we can keep it as nominal if we do not want to use the ordering. Let's choose nominal for now, considering that 1,2 and 3 are just labels for type of tickets, with no other meaning attached. We proceed in the same way:

new Csv()
        .withTypes(VarType.NOMINAL, "Survived", "Pclass")
        .read(root + "train.csv")
        .printSummary();

Notice that we append the variable name after Survived. This is possible since the withTypes method specify a type, and follows a dynamic array of strings, for the names of variables.

Name

This is the passenger names and the values are unique. As it is, the predictive power of this field is null. We keep it as it is. Note that it contains valuable information, but not in this direct form.

Sex

This field specifies the gender of the passenger. We have males and females.

Age

This field specifies the age of an passenger. We would expect that to parse this variable as numeric or at leas index, but is nominal. Why that happened? Notice that the values looks like numbers. But the first value (the most frequent one, instances) has nothing specified. Well, the variable is nominal has to do with how Csv parsing handles missing values. By default, the csv parsing considers as missing values only the string "?". But the most frequent value in this field is empty string "". This means that empty string is not considered a missing value. Because empty string can't produce numbers from parsing, the variable is promoted to nominal.

We can customize the missing value handling by specifying the valid strings for that purpose. We use .useNAValues(String...naValues) to tell the parser all the valid strings which are missing values. In our case we want just the empty string to be a missing value. When the parser will found an empty string it will set the variable value as missing value. It will not promote variable to nominal, since a missing value is a legal value.

new Csv()
        .withNAValues("")
        .withTypes(VarType.NOMINAL, "Survived", "Pclass")
        .read(root + "train.csv")
        .printSummary();
Frame Summary
=============
* rowCount: 891
* complete: 183/891
* varCount: 12
* varNames: 

 0. PassengerId : idx | 4.   Sex : nom |  8.   Ticket : nom |
 1.    Survived : nom | 5.   Age : num |  9.     Fare : num |
 2.      Pclass : nom | 6. SibSp : idx | 10.    Cabin : nom |
 3.        Name : nom | 7. Parch : idx | 11. Embarked : nom |

      PassengerId  Survived   Pclass 
   Min. :   1.000   0 : 549  3 : 491 
1st Qu. : 223.500   1 : 342  1 : 216 
 Median : 446.000            2 : 184 
   Mean : 446.000                    
2nd Qu. : 668.500                    
   Max. : 891.000                    

                                                       Name           Sex 
                            "Braund, Mr. Owen Harris" :   1    male : 577 
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" :   1  female : 314 
                             "Heikkinen, Miss. Laina" :   1               
       "Futrelle, Mrs. Jacques Heath (Lily May Peel)" :   1               
                           "Allen, Mr. William Henry" :   1               
                                   "Moran, Mr. James" :   1               
                                              (Other) : 885               
             Age            SibSp            Parch          Ticket               Fare 
   Min. :  0.420     Min. : 0.000     Min. : 0.000    347082 :   7     Min. :   0.000 
1st Qu. : 20.125  1st Qu. : 0.000  1st Qu. : 0.000      1601 :   7  1st Qu. :   7.910 
 Median : 28.000   Median : 0.000   Median : 0.000  CA. 2343 :   7   Median :  14.454 
   Mean : 29.699     Mean : 0.523     Mean : 0.382   3101295 :   6     Mean :  32.204 
2nd Qu. : 38.000  2nd Qu. : 1.000  2nd Qu. : 0.000   CA 2144 :   6  2nd Qu. :  31.000 
   Max. : 80.000     Max. : 8.000     Max. : 6.000    347088 :   6     Max. : 512.329 
   NA's :    177                                     (Other) : 852                    
            Cabin    Embarked 
         G6 :   4     S : 644 
C23 C25 C27 :   4     C : 168 
    B96 B98 :   4     Q :  77 
        F33 :   3  NA's :   2 
       E101 :   3             
    (Other) : 183             
       NA's : 687

Notice what happened: Age field is now numeric and it contains missing values.

SibSp

It's meaning is "siblings/spouses". It's parsed as index, which is natural. In pathological cases with sick imagination we can consider a "quarter of a wife" for example.

Parch

It's meaning is "parents/children". It is naturally parsed as index.

Ticket

This is the code of the ticket. Probably a family can have the same ticket, thus must be the reason why the frequencies have values up to . This field is nominal. It has low predictive power used directly. Perhaps contains valuable information, but used directly in row format would not help much.

Fare

This is the price for passenger fare and should be numeric, like it is.

Cabin

Code of the passenger's cabin, parsed as nominal. Same notes as for Ticket variable.

Embarked

Code for the embarking city, which could be: C = Cherbourg, Q = Queenstown, S = Southampton. It's parsed as nominal and has missing values.

If we are content with our parsing, we load data into a data frame for later use:

Frame train = new Csv()
        .withNAValues("")
        .withTypes(VarType.NOMINAL, "Survived", "Pclass")
        .read(root + "train.csv");
train.printSummary();

Read test data from csv file

Once we have a training frame we can load also the test data. We do that to take a look at the frame and because data is small and there is no memory or time problem cost associated with it. To avoid adding again the csv options and to get identical levels nominal variables, we use a different way to parse the data set. We specify variable types by frame templates:

Frame test = new Csv()
        .withNAValues("")
        .withTemplate(train)
        .read(root + "test.csv");

Instead to specify again the preferred types for variables, we use train frame as a template for variable types. This has also the side effect that the encoding of categorical variables is identical.

Frame Summary
=============
* rowCount: 891
* complete: 183/891
* varCount: 12
* varNames: 

 0. PassengerId : idx | 4.   Sex : nom |  8.   Ticket : nom |
 1.    Survived : nom | 5.   Age : num |  9.     Fare : num |
 2.      Pclass : nom | 6. SibSp : idx | 10.    Cabin : nom |
 3.        Name : nom | 7. Parch : idx | 11. Embarked : nom |

      PassengerId  Survived   Pclass 
   Min. :   1.000   0 : 549  3 : 491 
1st Qu. : 223.500   1 : 342  1 : 216 
 Median : 446.000            2 : 184 
   Mean : 446.000                    
2nd Qu. : 668.500                    
   Max. : 891.000                    

                                                       Name           Sex 
                            "Braund, Mr. Owen Harris" :   1    male : 577 
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" :   1  female : 314 
                             "Heikkinen, Miss. Laina" :   1               
       "Futrelle, Mrs. Jacques Heath (Lily May Peel)" :   1               
                           "Allen, Mr. William Henry" :   1               
                                   "Moran, Mr. James" :   1               
                                              (Other) : 885               
             Age            SibSp            Parch          Ticket               Fare 
   Min. :  0.420     Min. : 0.000     Min. : 0.000    347082 :   7     Min. :   0.000 
1st Qu. : 20.125  1st Qu. : 0.000  1st Qu. : 0.000      1601 :   7  1st Qu. :   7.910 
 Median : 28.000   Median : 0.000   Median : 0.000  CA. 2343 :   7   Median :  14.454 
   Mean : 29.699     Mean : 0.523     Mean : 0.382   3101295 :   6     Mean :  32.204 
2nd Qu. : 38.000  2nd Qu. : 1.000  2nd Qu. : 0.000   CA 2144 :   6  2nd Qu. :  31.000 
   Max. : 80.000     Max. : 8.000     Max. : 6.000    347088 :   6     Max. : 512.329 
   NA's :    177                                     (Other) : 852                    
            Cabin    Embarked 
         G6 :   4     S : 644 
C23 C25 C27 :   4     C : 168 
    B96 B98 :   4     Q :  77 
        F33 :   3  NA's :   2 
       E101 :   3             
    (Other) : 183             
       NA's : 687

We can note that we don't have Survived variable anymore. This is correct since this is what we have to predict. Note also that the types for the remaining variables are the same with training data set.

results matching ""

    No results matching ""