CForest model

Random forests are well-known to work well when the irreducible error from the training data is high. This is probably the case of this Titanic data set. We have reasons to believe that this is the situation since it was a tragedy. A lot of random or not-so-expected things happened. That happened despite of the bravery and the sacrifice of the crew and others.

Random forests are the invention of Leo Breiman. The first design was a joint effort together with Adele Cutler. The base of random forests is bagging (or bootstrapp aggregation). On top of that, selecting just a random limited number of variables at each node is the core of the algorithm.

We will work with random forests for now. This ensemble is mode robust and is capable of obtaining much better results than a single tree. At the same time we will introduce 10-fold cross validation to check our progress and estimate the error produced.

In the beginning we will use 10-fold cross validation for estimating the accuracy on public leader board. We will build a static method which will to cross validation for a single classifier. Note that there are similar construct in the library, but is instructive to build one tailored for our needs.

Build a 10-fold cross validation

public static void cv(Frame df, Classifier c) {
    String target = "Survived"; // this is our class
    int folds = 10; // number of folds

    // take 10 samples from our data set
    Mapping[] mappings = buildFolds(df, target, folds);
    NumericVar acc = NumericVar.empty(folds); // variable to store accuracies for each fold

    WS.printf("Cross validation 10-fold\n");
    for (int fold = 0; fold < folds; fold++) {
        Frame train = df.removeRows(mappings[fold]); // train used in this fold
        Frame test = df.mapRows(mappings[fold]); // remaining instances

        Classifier cc = c.newInstance(); // builds a new instance of classifier
        cc.train(train, target); // train it on old data se

        // build a confusion matrix to compute accuracy
        double foldAcc = new Confusion(
            test.var(target), cc.fit(test).firstClasses()).accuracy();
        acc.setValue(fold, foldAcc); // collect accuracy

        WS.printf("CV fold:%2d, acc: %.6f, mean: %.6f, se: %.6f\n",
                fold + 1,
                foldAcc,
                CoreTools.mean(acc).value(),
                CoreTools.var(acc).sdValue());
    }
    WS.printf("=================\n");
    WS.printf("mean: %.6f, se: %.6f\n\n",
            CoreTools.mean(acc).value(),
            CoreTools.var(acc).sdValue());
}

// builds almost equals folds with samples stratified by target field
// the idea is to get folds with proportion of strata as close as possible 
public static Mapping[] buildFolds(Frame df, String target, int folds) {
    Var rows = IntStream.range(0, df.rowCount()).boxed().collect(Index.collector());
    rows = Filters.shuffle(rows); // shuffle all rows
    rows = Filters.refSort(rows, df.var(target).refComparator()); // sort shuffle by target

    // build strata
    Mapping[] strata = new Mapping[folds];
    for (int i = 0; i < folds; i++) {
        strata[i] = Mapping.empty();
    }
    for (int i = 0; i < df.rowCount(); i++) {
        strata[i % folds].add(rows.index(i));
    }
    return strata;
}

We can use this 10-fold cross validation procedure to test our old tree classifier in the following way:

cv(train.mapVars("Survived,Sex,Pclass,Embarked"), CTree.newCART());

This will give us the following results:

Cross validation 10-fold
CV fold: 1, acc: 0.811111, mean: 0.811111, se: NaN
CV fold: 2, acc: 0.820225, mean: 0.815668, se: 0.006444
CV fold: 3, acc: 0.786517, mean: 0.805951, se: 0.017436
CV fold: 4, acc: 0.797753, mean: 0.803901, se: 0.014815
CV fold: 5, acc: 0.752809, mean: 0.793683, se: 0.026205
CV fold: 6, acc: 0.887640, mean: 0.809342, se: 0.044952
CV fold: 7, acc: 0.707865, mean: 0.794846, se: 0.056169
CV fold: 8, acc: 0.853933, mean: 0.802232, se: 0.056042
CV fold: 9, acc: 0.910112, mean: 0.814218, se: 0.063571
CV fold:10, acc: 0.786517, mean: 0.811448, se: 0.060572
=================
mean: 0.811448, se: 0.060572

Our first random forest

The name of the random forest implementation is CForest. To build a new ensemble of trees, one have to instantiate it in the following way:

Classifier rf = CForest.newRF();

There are a lot of things which can be customized for a random forest. Among them one can change:

Number of trees for classification
Which kind of weak classifier to use (you can customize this customized accordingly, like any other classifier)
Number of threads in pool (if you want to use parallelism)
What to do after each running step

Let's build one and use ore new cross validation procedure to estimate it's error.

RandomSource.setSeed(123);
Frame tr = train.mapVars("Survived,Sex,Pclass,Embarked");
CForest rf = CForest.newRF().withRuns(100);
cv(tr, rf);

Cross validation 10-fold
CV fold: 1, acc: 0.833333, mean: 0.833333, se: NaN
CV fold: 2, acc: 0.820225, mean: 0.826779, se: 0.009269
CV fold: 3, acc: 0.808989, mean: 0.820849, se: 0.012184
CV fold: 4, acc: 0.808989, mean: 0.817884, se: 0.011582
CV fold: 5, acc: 0.764045, mean: 0.807116, se: 0.026083
CV fold: 6, acc: 0.797753, mean: 0.805556, se: 0.023641
CV fold: 7, acc: 0.876404, mean: 0.815677, se: 0.034392
CV fold: 8, acc: 0.820225, mean: 0.816245, se: 0.031881
CV fold: 9, acc: 0.797753, mean: 0.814191, se: 0.030453
CV fold:10, acc: 0.786517, mean: 0.811423, se: 0.030015
=================
mean: 0.811423, se: 0.030015

Well, an identical output. This is due to the fact that our variables are already exhausted by the tree. It looks like an underfit. If one consider bias variance trade off, one can see this as high bias. We need to enrich our feature space to improve our performance.

Let's be direct and test what would happen if we would use all our directly usable features? This time we will fit also the training data set, to see the distribution of the training error.

RandomSource.setSeed(123);
Frame tr = train.mapVars("Survived,Sex,Pclass,Embarked,Age,Fare,SibSp,Parch");
CForest rf = CForest.newRF().withRuns(100);
cv(tr, rf);

rf.train(tr, "Survived");
CFit fit = rf.fit(test);
new Confusion(tr.var("Survived"), rf.fit(tr).firstClasses()).printSummary();

Cross validation 10-fold
CV fold: 1, acc: 0.844444, mean: 0.844444, se: NaN
CV fold: 2, acc: 0.820225, mean: 0.832335, se: 0.017126
CV fold: 3, acc: 0.808989, mean: 0.824553, se: 0.018120
CV fold: 4, acc: 0.786517, mean: 0.815044, se: 0.024095
CV fold: 5, acc: 0.764045, mean: 0.804844, se: 0.030913
CV fold: 6, acc: 0.808989, mean: 0.805535, se: 0.027701
CV fold: 7, acc: 0.820225, mean: 0.807633, se: 0.025890
CV fold: 8, acc: 0.797753, mean: 0.806398, se: 0.024222
CV fold: 9, acc: 0.853933, mean: 0.811680, se: 0.027649
CV fold:10, acc: 0.808989, mean: 0.811411, se: 0.026081
=================
mean: 0.811411, se: 0.026081

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >540    9 |   549
     1 |   15 >327 |   342
 ----- |    -    - | -----
 total |  555  336 |   891


Complete cases 891 from 891
Acc: 0.973064         (Accuracy )
F1:  0.9782609         (F1 score / F-measure)
MCC: 0.9429616         (Matthew correlation coefficient)
Pre: 0.972973         (Precision)
Rec: 0.9836066         (Recall)
G:   0.9782753         (G-measure)

This time we have a good example of overfit. Why is that? Look at the confusion matrix on the training set. We fit too well the training data. This data set is well known for its high irreducible error. And there is an explanation for that. During the tragic event a lot of exceptional things happened. For example I read somewhere that an old lady which had a dog was not allowed to embark with her pet due to regulations. As a consequence she decided to not leave it and she chose to die with him. It's close to impossible to learn those kind of things, even if the information would be available.

We should reduce the error somehow. We can try to decrease the overfit by adding more learners. Let's see if that would be enough for our purpose.

RandomSource.setSeed(123);
Frame tr = train.mapVars("Survived,Sex,Pclass,Embarked,Age,Fare,SibSp,Parch");
CForest rf = CForest.newRF().withRuns(500);
cv(tr, rf);

rf.train(tr, "Survived");
CFit fit = rf.fit(test);
new Confusion(tr.var("Survived"), rf.fit(tr).firstClasses()).printSummary();

Cross validation 10-fold
CV fold: 1, acc: 0.844444, mean: 0.844444, se: NaN
CV fold: 2, acc: 0.820225, mean: 0.832335, se: 0.017126
CV fold: 3, acc: 0.797753, mean: 0.820807, se: 0.023351
CV fold: 4, acc: 0.786517, mean: 0.812235, se: 0.025641
CV fold: 5, acc: 0.775281, mean: 0.804844, se: 0.027681
CV fold: 6, acc: 0.808989, mean: 0.805535, se: 0.024816
CV fold: 7, acc: 0.820225, mean: 0.807633, se: 0.023324
CV fold: 8, acc: 0.808989, mean: 0.807803, se: 0.021600
CV fold: 9, acc: 0.865169, mean: 0.814177, se: 0.027819
CV fold:10, acc: 0.797753, mean: 0.812534, se: 0.026737
=================
mean: 0.812534, se: 0.026737

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >540    9 |   549
     1 |   16 >326 |   342
 ----- |    -    - | -----
 total |  556  335 |   891


Complete cases 891 from 891
Acc: 0.9719416         (Accuracy )
F1:  0.9773756         (F1 score / F-measure)
MCC: 0.9405826         (Matthew correlation coefficient)
Pre: 0.971223         (Precision)
Rec: 0.9836066         (Recall)
G:   0.9773952         (G-measure)

This is slightly better than before. But the difference does not look significantly better than previous. We will use a simple pre-pruning strategy is to limit the number instances in leaf nodes. We set the minimum count to $3$ .

RandomSource.setSeed(123);
Frame tr = train.mapVars("Survived,Sex,Pclass,Embarked,Age,Fare,SibSp,Parch");
CForest rf = CForest.newRF()
        .withClassifier(CTree.newCART().withMinCount(3))
        .withRuns(100);
cv(tr, rf);

rf.train(tr, "Survived");
CFit fit = rf.fit(test);
new Confusion(tr.var("Survived"), rf.fit(tr).firstClasses()).printSummary();

Notice that we changed the classifier used by CForest. This is the same classifier used by default by random forest. We do this because we customized the classifier by changing the min count parameter.

Cross validation 10-fold
CV fold: 1, acc: 0.855556, mean: 0.855556, se: NaN
CV fold: 2, acc: 0.853933, mean: 0.854744, se: 0.001148
CV fold: 3, acc: 0.808989, mean: 0.839492, se: 0.026429
CV fold: 4, acc: 0.808989, mean: 0.831866, se: 0.026425
CV fold: 5, acc: 0.820225, mean: 0.829538, se: 0.023470
CV fold: 6, acc: 0.820225, mean: 0.827986, se: 0.021333
CV fold: 7, acc: 0.865169, mean: 0.833298, se: 0.024016
CV fold: 8, acc: 0.808989, mean: 0.830259, se: 0.023838
CV fold: 9, acc: 0.842697, mean: 0.831641, se: 0.022680
CV fold:10, acc: 0.808989, mean: 0.829376, se: 0.022551
=================
mean: 0.829376, se: 0.022551

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >525   24 |   549
     1 |   49 >293 |   342
 ----- |    -    - | -----
 total |  574  317 |   891


Complete cases 891 from 891
Acc: 0.9180696         (Accuracy )
F1:  0.9349955         (F1 score / F-measure)
MCC: 0.8258652         (Matthew correlation coefficient)
Pre: 0.9146341         (Precision)
Rec: 0.9562842         (Recall)
G:   0.9352273         (G-measure)

That had indeed some effect. However after submitting to competition we did not saw any improvement. We should look forward to engineer a little bit our features for further improvements.

Random forest model

CForest model

Build a 10-fold cross validation

Our first random forest

results matching ""

No results matching ""