Getting started: Kaggle's Titanic Competition
Get the data
The purpose of the competition is to predict which passengers have survived or not. The available data has two parts. The first part consists in a data set which contains what happened with some passengers and some related information like sex, cabin, age, class, etc. This data set contains information regarding their survival. The purpose why this data set contains survival data is because it will be used to train a model which learns how to decide if a passenger survives or not. This is the train.csv
. The other file is a data set which contains data about another set of passenger, this time without knowing if they survived or not. They contain, however an identification number. This data set is test.csv
and this is used to make predictions. Those predictions should be similar with the provided gendermodel.csv
.
We also have to take a look of the data description provided on contest dedicated page:
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
The first step in our adventure is to download those 3 data file in csv format. You can do it from data section of the competition. Let's suppose you downloaded somewhere in a local folder. We will name this folder data
folder, and actually it can have any name you would like.
Read train data from csv file
Because the data is small we can load the whole data in memory with no problems.
Let's see how we can load the data into memory. In rapaio the sets of data are loaded into the form of frames (rapaio.data.Frame
). A frame is basically a tabular data, with columns for each variable (feature) and rows for each instance (in our case for each passenger).
A first try of loading the train data set and see what has happened is the following:
new Csv().read(root + "train.csv").printSummary();
What happened is that the csv reader was instantiated, a frame was loaded from the train.csv
file, an instance of data frame was created and a method was called to see a print summary of the loaded frame.
Frame Summary
=============
* rowCount: 891
* complete: 889/891
* varCount: 12
* varNames:
0. PassengerId : idx | 4. Sex : nom | 8. Ticket : nom |
1. Survived : bin | 5. Age : nom | 9. Fare : num |
2. Pclass : idx | 6. SibSp : idx | 10. Cabin : nom |
3. Name : nom | 7. Parch : idx | 11. Embarked : nom |
PassengerId Survived Pclass
Min. : 1.000 0 : 549 Min. : 1.000
1st Qu. : 223.500 1 : 342 1st Qu. : 2.000
Median : 446.000 NA's : 0 Median : 3.000
Mean : 446.000 Mean : 2.309
2nd Qu. : 668.500 2nd Qu. : 3.000
Max. : 891.000 Max. : 3.000
Name Sex Age
"Braund, Mr. Owen Harris" : 1 male : 577 : 177
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" : 1 female : 314 24 : 30
"Heikkinen, Miss. Laina" : 1 22 : 27
"Futrelle, Mrs. Jacques Heath (Lily May Peel)" : 1 18 : 26
"Allen, Mr. William Henry" : 1 28 : 25
"Moran, Mr. James" : 1 19 : 25
(Other) : 885 (Other) : 581
SibSp Parch Ticket Fare Cabin
Min. : 0.000 Min. : 0.000 347082 : 7 Min. : 0.000 : 687
1st Qu. : 0.000 1st Qu. : 0.000 1601 : 7 1st Qu. : 7.910 G6 : 4
Median : 0.000 Median : 0.000 CA. 2343 : 7 Median : 14.454 C23 C25 C27 : 4
Mean : 0.523 Mean : 0.382 3101295 : 6 Mean : 32.204 B96 B98 : 4
2nd Qu. : 1.000 2nd Qu. : 0.000 CA 2144 : 6 2nd Qu. : 31.000 F33 : 3
Max. : 8.000 Max. : 6.000 347088 : 6 Max. : 512.329 E101 : 3
(Other) : 852 (Other) : 186
Embarked
S : 644
C : 168
Q : 77
NA's : 2
How can we interpret the output of the frame's summary?
- We loaded a frame which has rows and columns (variables)
- From all the rows, are complete (non missing data)
- The name of the variables are listed, together with their types
- It follows a data summary for the frame: 6 number summary for numeric variables, most frequent levels for nominal variables
Let's inspect each variable and see how it fits our needs.
PassengedId
The type for this variable is index (integer values). This field looks like an identifier for the passenger, so from our point of view the sorting is not required. What we can do, but is not required, is to change the field type to nominal. Anyway, we do not need this field for learning since it should be unique for each instance, thus the predictive power is null. We will ignore it for now since we will not consider it for learning
Survived
This is our target variable. It is parsed as binary, but since we do classification, we will change it's type to nominal. We do that directly from the csv parsing, by indicating that we want Survived parsed as nominal variable:
new Csv()
.withTypes(VarType.NOMINAL, "Survived")
.read(root + "train.csv")
.printSummary();
The output becomes
Frame Summary
=============
* rowCount: 891
* complete: 889/891
* varCount: 12
* varNames:
0. PassengerId : idx | 4. Sex : nom | 8. Ticket : nom |
1. Survived : nom | 5. Age : nom | 9. Fare : num |
2. Pclass : idx | 6. SibSp : idx | 10. Cabin : nom |
3. Name : nom | 7. Parch : idx | 11. Embarked : nom |
PassengerId Survived Pclass
Min. : 1.000 0 : 549 Min. : 1.000
1st Qu. : 223.500 1 : 342 1st Qu. : 2.000
Median : 446.000 Median : 3.000
Mean : 446.000 Mean : 2.309
2nd Qu. : 668.500 2nd Qu. : 3.000
Max. : 891.000 Max. : 3.000
Name Sex Age
"Braund, Mr. Owen Harris" : 1 male : 577 : 177
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" : 1 female : 314 24 : 30
"Heikkinen, Miss. Laina" : 1 22 : 27
"Futrelle, Mrs. Jacques Heath (Lily May Peel)" : 1 18 : 26
"Allen, Mr. William Henry" : 1 28 : 25
"Moran, Mr. James" : 1 19 : 25
(Other) : 885 (Other) : 581
SibSp Parch Ticket Fare Cabin
Min. : 0.000 Min. : 0.000 347082 : 7 Min. : 0.000 : 687
1st Qu. : 0.000 1st Qu. : 0.000 1601 : 7 1st Qu. : 7.910 G6 : 4
Median : 0.000 Median : 0.000 CA. 2343 : 7 Median : 14.454 C23 C25 C27 : 4
Mean : 0.523 Mean : 0.382 3101295 : 6 Mean : 32.204 B96 B98 : 4
2nd Qu. : 1.000 2nd Qu. : 0.000 CA 2144 : 6 2nd Qu. : 31.000 F33 : 3
Max. : 8.000 Max. : 6.000 347088 : 6 Max. : 512.329 E101 : 3
(Other) : 852 (Other) : 186
Embarked
S : 644
C : 168
Q : 77
NA's : 2
And notice how type of the Survived
variable changed to nominal.
Pclass
This variable has index type. We can keep it like it is or we can change it to nominal. Both ways can be useful. For example parsed as index could give an interpretation to the order. We can say that somehow, because of ordering class 1 is lower than class 2, and class 2 is between classes 1 and 3. At the same time we can keep it as nominal if we do not want to use the ordering. Let's choose nominal for now, considering that 1,2 and 3 are just labels for type of tickets, with no other meaning attached. We proceed in the same way:
new Csv()
.withTypes(VarType.NOMINAL, "Survived", "Pclass")
.read(root + "train.csv")
.printSummary();
Notice that we append the variable name after Survived
. This is possible since the withTypes
method specify a type, and follows a dynamic array of strings, for the names of variables.
Name
This is the passenger names and the values are unique. As it is, the predictive power of this field is null. We keep it as it is. Note that it contains valuable information, but not in this direct form.
Sex
This field specifies the gender of the passenger. We have males and females.
Age
This field specifies the age of an passenger. We would expect that to parse this variable as numeric or at leas index, but is nominal. Why that happened? Notice that the values looks like numbers. But the first value (the most frequent one, instances) has nothing specified. Well, the variable is nominal has to do with how Csv parsing handles missing values. By default, the csv parsing considers as missing values only the string "?". But the most frequent value in this field is empty string "". This means that empty string is not considered a missing value. Because empty string can't produce numbers from parsing, the variable is promoted to nominal.
We can customize the missing value handling by specifying the valid strings for that purpose. We use .useNAValues(String...naValues)
to tell the parser all the valid strings which are missing values. In our case we want just the empty string to be a missing value. When the parser will found an empty string it will set the variable value as missing value. It will not promote variable to nominal, since a missing value is a legal value.
new Csv()
.withNAValues("")
.withTypes(VarType.NOMINAL, "Survived", "Pclass")
.read(root + "train.csv")
.printSummary();
Frame Summary
=============
* rowCount: 891
* complete: 183/891
* varCount: 12
* varNames:
0. PassengerId : idx | 4. Sex : nom | 8. Ticket : nom |
1. Survived : nom | 5. Age : num | 9. Fare : num |
2. Pclass : nom | 6. SibSp : idx | 10. Cabin : nom |
3. Name : nom | 7. Parch : idx | 11. Embarked : nom |
PassengerId Survived Pclass
Min. : 1.000 0 : 549 3 : 491
1st Qu. : 223.500 1 : 342 1 : 216
Median : 446.000 2 : 184
Mean : 446.000
2nd Qu. : 668.500
Max. : 891.000
Name Sex
"Braund, Mr. Owen Harris" : 1 male : 577
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" : 1 female : 314
"Heikkinen, Miss. Laina" : 1
"Futrelle, Mrs. Jacques Heath (Lily May Peel)" : 1
"Allen, Mr. William Henry" : 1
"Moran, Mr. James" : 1
(Other) : 885
Age SibSp Parch Ticket Fare
Min. : 0.420 Min. : 0.000 Min. : 0.000 347082 : 7 Min. : 0.000
1st Qu. : 20.125 1st Qu. : 0.000 1st Qu. : 0.000 1601 : 7 1st Qu. : 7.910
Median : 28.000 Median : 0.000 Median : 0.000 CA. 2343 : 7 Median : 14.454
Mean : 29.699 Mean : 0.523 Mean : 0.382 3101295 : 6 Mean : 32.204
2nd Qu. : 38.000 2nd Qu. : 1.000 2nd Qu. : 0.000 CA 2144 : 6 2nd Qu. : 31.000
Max. : 80.000 Max. : 8.000 Max. : 6.000 347088 : 6 Max. : 512.329
NA's : 177 (Other) : 852
Cabin Embarked
G6 : 4 S : 644
C23 C25 C27 : 4 C : 168
B96 B98 : 4 Q : 77
F33 : 3 NA's : 2
E101 : 3
(Other) : 183
NA's : 687
Notice what happened: Age field is now numeric and it contains missing values.
SibSp
It's meaning is "siblings/spouses". It's parsed as index, which is natural. In pathological cases with sick imagination we can consider a "quarter of a wife" for example.
Parch
It's meaning is "parents/children". It is naturally parsed as index.
Ticket
This is the code of the ticket. Probably a family can have the same ticket, thus must be the reason why the frequencies have values up to . This field is nominal. It has low predictive power used directly. Perhaps contains valuable information, but used directly in row format would not help much.
Fare
This is the price for passenger fare and should be numeric, like it is.
Cabin
Code of the passenger's cabin, parsed as nominal. Same notes as for Ticket
variable.
Embarked
Code for the embarking city, which could be: C = Cherbourg, Q = Queenstown, S = Southampton. It's parsed as nominal and has missing values.
If we are content with our parsing, we load data into a data frame for later use:
Frame train = new Csv()
.withNAValues("")
.withTypes(VarType.NOMINAL, "Survived", "Pclass")
.read(root + "train.csv");
train.printSummary();
Read test data from csv file
Once we have a training frame we can load also the test data. We do that to take a look at the frame and because data is small and there is no memory or time problem cost associated with it. To avoid adding again the csv options and to get identical levels nominal variables, we use a different way to parse the data set. We specify variable types by frame templates:
Frame test = new Csv()
.withNAValues("")
.withTemplate(train)
.read(root + "test.csv");
Instead to specify again the preferred types for variables, we use train frame as a template for variable types. This has also the side effect that the encoding of categorical variables is identical.
Frame Summary
=============
* rowCount: 891
* complete: 183/891
* varCount: 12
* varNames:
0. PassengerId : idx | 4. Sex : nom | 8. Ticket : nom |
1. Survived : nom | 5. Age : num | 9. Fare : num |
2. Pclass : nom | 6. SibSp : idx | 10. Cabin : nom |
3. Name : nom | 7. Parch : idx | 11. Embarked : nom |
PassengerId Survived Pclass
Min. : 1.000 0 : 549 3 : 491
1st Qu. : 223.500 1 : 342 1 : 216
Median : 446.000 2 : 184
Mean : 446.000
2nd Qu. : 668.500
Max. : 891.000
Name Sex
"Braund, Mr. Owen Harris" : 1 male : 577
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" : 1 female : 314
"Heikkinen, Miss. Laina" : 1
"Futrelle, Mrs. Jacques Heath (Lily May Peel)" : 1
"Allen, Mr. William Henry" : 1
"Moran, Mr. James" : 1
(Other) : 885
Age SibSp Parch Ticket Fare
Min. : 0.420 Min. : 0.000 Min. : 0.000 347082 : 7 Min. : 0.000
1st Qu. : 20.125 1st Qu. : 0.000 1st Qu. : 0.000 1601 : 7 1st Qu. : 7.910
Median : 28.000 Median : 0.000 Median : 0.000 CA. 2343 : 7 Median : 14.454
Mean : 29.699 Mean : 0.523 Mean : 0.382 3101295 : 6 Mean : 32.204
2nd Qu. : 38.000 2nd Qu. : 1.000 2nd Qu. : 0.000 CA 2144 : 6 2nd Qu. : 31.000
Max. : 80.000 Max. : 8.000 Max. : 6.000 347088 : 6 Max. : 512.329
NA's : 177 (Other) : 852
Cabin Embarked
G6 : 4 S : 644
C23 C25 C27 : 4 C : 168
B96 B98 : 4 Q : 77
F33 : 3 NA's : 2
E101 : 3
(Other) : 183
NA's : 687
We can note that we don't have Survived variable anymore. This is correct since this is what we have to predict. Note also that the types for the remaining variables are the same with training data set.