SVM model

SVM (Support Vector Machines) is a nice framework to test new ideas for various types of problems. The power of SMVs comes from their kernels. A kernel is basically a transformation of the original space generated by the input features into another space, often with more dimensions. It's like a feature engineering in a singe function.

But SVMs have a practical problem. The features needs to be numeric and does not allows missing values. This is not a constrain on the algorithm itself. At any moment one can build a kernel for nominal features. But the implemented ones allows only numeric non missing values and is much simpler to shape our data into this format.

How can we do that?

Data preparation

We can use a filter to impute data for missing values. The filter we use is an imputation with a classifier of imputation with a regression. The logic is the following: train a classifier from a specified set of input features to predict the field with missing values. The data set inside the filter is filtered to contain only instances with non-missing target values.

After we impute the missing values we encode nominal features into numeric features. We can accomplish this task using, again, another filter for this purpose. The name of this filter is FFOneHotEncoding. What it does is to create a number of numeric feature for level of the nominal variable. Than the values on these numeric variables receives the value of the indicator function. We have $1$ if the level equals the numeric variable's name, $0$ otherwise.

After we have numerical variables, is better to make all the variables to be in the same range. This is not a requirement for SMVs in general. The meaning is to give same weight to all the involved variables. As a side effect it makes the algorithm to run faster. This is due to the fact that the convex optimization problem has smaller chances to have a close-to-flat big surfaces.

Finally, we will remove the not used variables from the frame in order to be prepared for learning.

FFilter[] inputFilters = new FFilter[]{
        new CustomFilter(),

        new FFImputeWithRegression(RForest.newRF().withRuns(100), 
            new VarRange("Age,Pclass,Embarked,Sex,Fare,Title"), "Age"),

        new FFImputeByClassifier(CForest.newRF().withRuns(10), 
            new VarRange("Embarked,Age,Pclass,Sex,Title"), "Embarked"),

        new FFImputeByClassifier(CForest.newRF().withRuns(100), 
            new VarRange("Age,Pclass,Embarked,Sex,Fare,Ticket"), "Ticket"),

        new FFImputeByClassifier(CForest.newRF().withRuns(100), 
            new VarRange("Age,Pclass,Embarked,Sex,Fare,Cabin"), "Cabin"),

        new FFOneHotEncoding("Sex,Embarked,Title,Cabin"),
        new FFOneHotEncoding("Ticket"),

        new FFStandardize("all"),

        new FFRemoveVars("PassengerId,Name,SibSp,Parch")
};
train.applyFilters(inputFilters).printSummary()

Frame Summary
=============
* rowCount: 891
* complete: 891/891
* varCount: 41
* varNames: 

  0.   Survived : nom | 14. Ticket.3 : num | 28.        Cabin.G : num |
  1.     Pclass : num | 15. Ticket.2 : num | 29.        Cabin.D : num |
  2.   Sex.male : num | 16. Ticket.C : num | 30.        Cabin.A : num |
  3. Sex.female : num | 17. Ticket.7 : num | 31.        Cabin.B : num |
  4.        Age : num | 18. Ticket.W : num | 32.        Cabin.F : num |
  5.       Fare : num | 19. Ticket.4 : num | 33.        Cabin.T : num |
  6. Embarked.S : num | 20. Ticket.F : num | 34.   Title.Master : num |
  7. Embarked.C : num | 21. Ticket.L : num | 35.      Title.Rev : num |
  8. Embarked.Q : num | 22. Ticket.9 : num | 36.       Title.Mr : num |
  9. FamilySize : num | 23. Ticket.6 : num | 37.     Title.Miss : num |
 10.   Ticket.A : num | 24. Ticket.5 : num | 38.       Title.Dr : num |
 11.   Ticket.P : num | 25. Ticket.8 : num | 39.      Title.Mrs : num |
 12.   Ticket.S : num | 26.  Cabin.C : num | 40. Title.Military : num |
 13.   Ticket.1 : num | 27.  Cabin.E : num |                           

Survived            Pclass          Sex.male        Sex.female               Age 
 0 : 549     Min. : -1.565     Min. : -1.355     Min. : -0.737     Min. : -2.781 
 1 : 342  1st Qu. : -0.369  1st Qu. : -1.355  1st Qu. : -0.737  1st Qu. : -0.336 
           Median :  0.827   Median :  0.737   Median : -0.737   Median : -0.113 
             Mean : -0.000     Mean : -0.000     Mean :  0.000     Mean :  0.000 
          2nd Qu. :  0.827  2nd Qu. :  0.737  2nd Qu. :  1.355  2nd Qu. :  0.483 
             Max. :  0.827     Max. :  0.737     Max. :  1.355     Max. :  2.929 

            Fare        Embarked.S        Embarked.C        Embarked.Q        FamilySize 
   Min. : -0.648     Min. : -1.632     Min. : -0.482     Min. : -0.303     Min. : -0.561 
1st Qu. : -0.489  1st Qu. : -1.632  1st Qu. : -0.482  1st Qu. : -0.303  1st Qu. : -0.561 
 Median : -0.357   Median :  0.612   Median : -0.482   Median : -0.303   Median : -0.561 
   Mean :  0.000     Mean :  0.000     Mean :  0.000     Mean :  0.000     Mean : -0.000 
2nd Qu. : -0.024  2nd Qu. :  0.612  2nd Qu. : -0.482  2nd Qu. : -0.303  2nd Qu. :  0.059 
   Max. :  9.662     Max. :  0.612     Max. :  2.073     Max. :  3.297     Max. :  5.637 

        Ticket.A          Ticket.P          Ticket.S          Ticket.1          Ticket.3 
   Min. : -0.139     Min. : -0.280     Min. : -0.251     Min. : -0.433     Min. : -0.767 
1st Qu. : -0.139  1st Qu. : -0.280  1st Qu. : -0.251  1st Qu. : -0.433  1st Qu. : -0.767 
 Median : -0.139   Median : -0.280   Median : -0.251   Median : -0.433   Median : -0.767 
   Mean : -0.000     Mean : -0.000     Mean : -0.000     Mean : -0.000     Mean :  0.000 
2nd Qu. : -0.139  2nd Qu. : -0.280  2nd Qu. : -0.251  2nd Qu. : -0.433  2nd Qu. :  1.303 
   Max. :  7.166     Max. :  3.563     Max. :  3.974     Max. :  2.305     Max. :  1.303 

        Ticket.2          Ticket.C          Ticket.7          Ticket.W          Ticket.4 
   Min. : -0.508     Min. : -0.261     Min. : -0.101     Min. : -0.101     Min. : -0.106 
1st Qu. : -0.508  1st Qu. : -0.261  1st Qu. : -0.101  1st Qu. : -0.101  1st Qu. : -0.106 
 Median : -0.508   Median : -0.261   Median : -0.101   Median : -0.101   Median : -0.106 
   Mean :  0.000     Mean : -0.000     Mean : -0.000     Mean :  0.000     Mean :  0.000 
2nd Qu. : -0.508  2nd Qu. : -0.261  2nd Qu. : -0.101  2nd Qu. : -0.101  2nd Qu. : -0.106 
   Max. :  1.966     Max. :  3.823     Max. :  9.894     Max. :  9.894     Max. :  9.381 

        Ticket.F          Ticket.L         Ticket.9          Ticket.6          Ticket.5 
   Min. : -0.082     Min. : -0.067     Min. : 0.000     Min. : -0.067     Min. : -0.034 
1st Qu. : -0.082  1st Qu. : -0.067  1st Qu. : 0.000  1st Qu. : -0.067  1st Qu. : -0.034 
 Median : -0.082   Median : -0.067   Median : 0.000   Median : -0.067   Median : -0.034 
   Mean : -0.000     Mean : -0.000     Mean : 0.000     Mean : -0.000     Mean : -0.000 
2nd Qu. : -0.082  2nd Qu. : -0.067  2nd Qu. : 0.000  2nd Qu. : -0.067  2nd Qu. : -0.034 
   Max. : 12.138     Max. : 14.883     Max. : 0.000     Max. : 14.883     Max. : 29.816 

        Ticket.8           Cabin.C           Cabin.E           Cabin.G           Cabin.D 
   Min. : -0.047     Min. : -0.379     Min. : -0.615     Min. : -0.301     Min. : -0.499 
1st Qu. : -0.047  1st Qu. : -0.379  1st Qu. : -0.615  1st Qu. : -0.301  1st Qu. : -0.499 
 Median : -0.047   Median : -0.379   Median : -0.615   Median : -0.301   Median : -0.499 
   Mean : -0.000     Mean : -0.000     Mean :  0.000     Mean : -0.000     Mean :  0.000 
2nd Qu. : -0.047  2nd Qu. : -0.379  2nd Qu. :  1.623  2nd Qu. : -0.301  2nd Qu. : -0.499 
   Max. : 21.071     Max. :  2.636     Max. :  1.623     Max. :  3.321     Max. :  2.000 

         Cabin.A           Cabin.B           Cabin.F          Cabin.T      Title.Master 
   Min. : -0.159     Min. : -0.269     Min. : -0.538     Min. : 0.000     Min. : -0.217 
1st Qu. : -0.159  1st Qu. : -0.269  1st Qu. : -0.538  1st Qu. : 0.000  1st Qu. : -0.217 
 Median : -0.159   Median : -0.269   Median : -0.538   Median : 0.000   Median : -0.217 
   Mean :  0.000     Mean : -0.000     Mean :  0.000     Mean : 0.000     Mean :  0.000 
2nd Qu. : -0.159  2nd Qu. : -0.269  2nd Qu. : -0.538  2nd Qu. : 0.000  2nd Qu. : -0.217 
   Max. :  6.281     Max. :  3.719     Max. :  1.858     Max. : 0.000     Max. :  4.610 

       Title.Rev          Title.Mr        Title.Miss          Title.Dr         Title.Mrs 
   Min. : -0.082     Min. : -1.183     Min. : -0.510     Min. : -0.089     Min. : -0.409 
1st Qu. : -0.082  1st Qu. : -1.183  1st Qu. : -0.510  1st Qu. : -0.089  1st Qu. : -0.409 
 Median : -0.082   Median :  0.844   Median : -0.510   Median : -0.089   Median : -0.409 
   Mean : -0.000     Mean : -0.000     Mean : -0.000     Mean :  0.000     Mean : -0.000 
2nd Qu. : -0.082  2nd Qu. :  0.844  2nd Qu. : -0.510  2nd Qu. : -0.089  2nd Qu. : -0.409 
   Max. : 12.138     Max. :  0.844     Max. :  1.959     Max. : 11.231     Max. :  2.440 

  Title.Military 
   Min. : -0.082 
1st Qu. : -0.082 
 Median : -0.082 
   Mean :  0.000 
2nd Qu. : -0.082 
   Max. : 12.138

There is a lot of content. Notice that we have numerical variables for each ticket first letter, title, cabin first letter, etc.

Train a polynomial SVM

A linear kernel is a polynomial kernel with degree 1. We let the $C$ parameter to the default value which is $1$ .

Classifier model = new BinarySMO()
        .withInputFilters(inputFilters)
        .withC(0.0001)
        .withKernel(new PolyKernel(1));
                model.train(train, "Survived");
CFit fit = model.fit(test);
new Confusion(train.var("Survived"), model.fit(train).firstClasses()).printSummary();
new Csv().withQuotes(false).write(SolidFrame.wrapOf(
        test.var("PassengerId"),
        fit.firstClasses().withName("Survived")
), root + "svm1-submit.csv");
cv(train, model);

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >292  257 |   549
     1 |   41 >301 |   342
 ----- |    -    - | -----
 total |  333  558 |   891


Complete cases 891 from 891
Acc: 0.6655443         (Accuracy )
F1:  0.6621315         (F1 score / F-measure)
MCC: 0.4141426         (Matthew correlation coefficient)
Pre: 0.8768769         (Precision)
Rec: 0.5318761         (Recall)
G:   0.6829274         (G-measure)
Cross validation 10-fold
CV fold: 1, acc: 0.666667, mean: 0.666667, se: NaN
CV fold: 2, acc: 0.764045, mean: 0.715356, se: 0.068857
CV fold: 3, acc: 0.775281, mean: 0.735331, se: 0.059730
CV fold: 4, acc: 0.707865, mean: 0.728464, se: 0.050666
CV fold: 5, acc: 0.719101, mean: 0.726592, se: 0.044077
CV fold: 6, acc: 0.696629, mean: 0.721598, se: 0.041278
CV fold: 7, acc: 0.842697, mean: 0.738898, se: 0.059286
CV fold: 8, acc: 0.730337, mean: 0.737828, se: 0.054972
CV fold: 9, acc: 0.752809, mean: 0.739492, se: 0.051663
CV fold:10, acc: 0.808989, mean: 0.746442, se: 0.053437
=================
mean: 0.746442, se: 0.053437

The results are not promising. This is better than random but it is not enough for our purpose. There are some explanations for this result. First one could be that if the space would be linear, than the original feature space would be the same as transformed. This means that a classifier as random forest would work well if the linear svm would have worked. This might not be true in general, but in this case it looks like a good explanation. We need to be more flexible.

To increase the flexibility of the model and to allow features to interact with one another we change the degree of the polynomial kernel. This time we will use degree=3. Also, we use $C=0.0001$ to allow for some errors. This parameter is the factor of the slack regularization constraints of the SVM optimization problem. The bigger the value the more is the penalty for wrong decisions. If the space would be linear separable than one can theoretically set this value as high as possible. But we know it is not. Also we know that we have plenty of irreducible error. As a consequence, it looks like we should decrease the value of this parameter.

Classifier model = new BinarySMO()
        .withInputFilters(inputFilters)
        .withC(0.0001)
        .withKernel(new PolyKernel(1));
                model.train(train, "Survived");
CFit fit = model.fit(test);
new Confusion(train.var("Survived"), model.fit(train).firstClasses()).printSummary();
new Csv().withQuotes(false).write(SolidFrame.wrapOf(
        test.var("PassengerId"),
        fit.firstClasses().withName("Survived")
), root + "svm1-submit.csv");
cv(train, model);

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >472   77 |   549
     1 |   48 >294 |   342
 ----- |    -    - | -----
 total |  520  371 |   891


Complete cases 891 from 891
Acc: 0.8597082         (Accuracy )
F1:  0.8830683         (F1 score / F-measure)
MCC: 0.7097044         (Matthew correlation coefficient)
Pre: 0.9076923         (Precision)
Rec: 0.859745         (Recall)
G:   0.8833934         (G-measure)
Cross validation 10-fold
CV fold: 1, acc: 0.822222, mean: 0.822222, se: NaN
CV fold: 2, acc: 0.786517, mean: 0.804370, se: 0.025248
CV fold: 3, acc: 0.853933, mean: 0.820891, se: 0.033728
CV fold: 4, acc: 0.808989, mean: 0.817915, se: 0.028174
CV fold: 5, acc: 0.797753, mean: 0.813883, se: 0.026012
CV fold: 6, acc: 0.786517, mean: 0.809322, se: 0.025809
CV fold: 7, acc: 0.831461, mean: 0.812484, se: 0.025002
CV fold: 8, acc: 0.876404, mean: 0.820474, se: 0.032350
CV fold: 9, acc: 0.764045, mean: 0.814204, se: 0.035631
CV fold:10, acc: 0.820225, mean: 0.814806, se: 0.033647
=================
mean: 0.814806, se: 0.033647

This time the results are promising. We achieved a training error which is not close to zero and the cross validation errors are close to our desired results. We definitely should try this classifier.

We have a better score also on public leader board. Which is very fine. Usually in this competition a score in $0.75-0.78$ is fine and one in $0.78-0.81$ is excellent.

Tuning manually the SVM

We can work more on SVMs. One thing which deserves a try is the radial basis kernel. This is similar with working in an infinite dimensional space! We tried some RBFKernel approaches, but much better results gave the CauchyKernel. The CauchyKernel works in a similar way like a RBF kernel. The difference which sometimes is important is that it is a distribution with tails fatter than Gaussian distribution. This produces an effect of long distance influence. This is reasonable to use in this problem because we know we have noise. We can think that a kernel which acts on wider ranges is better if it is combined with a small value for $C$ .

After some manual tuning we arrived at the following classifier.

Classifier model = new BinarySMO()
        .withInputFilters(inputFilters)
        .withC(1)
        .withTol(1e-10)
        .withKernel(new CauchyKernel(25));

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >520   29 |   549
     1 |  104 >238 |   342
 ----- |    -    - | -----
 total |  624  267 |   891


Complete cases 891 from 891
Acc: 0.8507295         (Accuracy )
F1:  0.8866155         (F1 score / F-measure)
MCC: 0.6826819         (Matthew correlation coefficient)
Pre: 0.8333333         (Precision)
Rec: 0.9471767         (Recall)
G:   0.8884334         (G-measure)
Cross validation 10-fold
CV fold: 1, acc: 0.811111, mean: 0.811111, se: NaN
CV fold: 2, acc: 0.786517, mean: 0.798814, se: 0.017391
CV fold: 3, acc: 0.865169, mean: 0.820932, se: 0.040235
CV fold: 4, acc: 0.797753, mean: 0.815137, se: 0.034836
CV fold: 5, acc: 0.831461, mean: 0.818402, se: 0.031040
CV fold: 6, acc: 0.842697, mean: 0.822451, se: 0.029481
CV fold: 7, acc: 0.820225, mean: 0.822133, se: 0.026926
CV fold: 8, acc: 0.820225, mean: 0.821895, se: 0.024937
CV fold: 9, acc: 0.797753, mean: 0.819212, se: 0.024676
CV fold:10, acc: 0.831461, mean: 0.820437, se: 0.023585
=================
mean: 0.820437, se: 0.023585

This classifier has similar results, but there are reasons to believe that it is slightly better than previous. The training error smaller. But we know that training error is not a good estimator. The 10 fold cv is greater. This is a good sign. A better interpretation would be that the gap between those two has shrunken and this is a good thing. A new submit on kaggle follows.

Well we are really, really close to our psychological milestone of $0.8$ . Perhaps some tuning will give more results. This is true in general. However, next section provides you with a better approach which usually provides some gain in accuracy: stacking.

SVM model

SVM model

Data preparation

Train a polynomial SVM

Tuning manually the SVM

results matching ""

No results matching ""