Feature engineering

Title feature

It is clear that we can't use directly the "Name" variable. This is due to the fact that names are almost unique, and that leads to a tiny generalization power. To understand that we should see that even if we learned that a passenger with a given name survived or not. We can't decide if another passenger survived, using only the name of the new passenger.

Lets inspect some of the values from "Name" variable.

SolidFrame.wrapOf(train.var("Name")).printLines(20);

                                 Name                          
  [0]                                 "Braund, Mr. Owen Harris"
  [1]     "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
  [2]                                  "Heikkinen, Miss. Laina"
  [3]            "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
  [4]                                "Allen, Mr. William Henry"
  [5]                                        "Moran, Mr. James"
  [6]                                 "McCarthy, Mr. Timothy J"
  [7]                          "Palsson, Master. Gosta Leonard"
  [8]       "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
  [9]                     "Nasser, Mrs. Nicholas (Adele Achem)"
 [10]                         "Sandstrom, Miss. Marguerite Rut"
 [11]                                "Bonnell, Miss. Elizabeth"
 [12]                          "Saundercock, Mr. William Henry"
 [13]                             "Andersson, Mr. Anders Johan"
 [14]                    "Vestrom, Miss. Hulda Amanda Adolfina"
 [15]                        "Hewlett, Mrs. (Mary D Kingcome) "
 [16]                                    "Rice, Master. Eugene"
 [17]                            "Williams, Mr. Charles Eugene"
 [18] "Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)"
 [19]                                 "Masselmani, Mrs. Fatima"

We notice that the names contains title information of the individual. This is valuable, but how can we benefit from that? First of all see that the format of that string is clear: space + title + dot + space. We can try to model a regular expression or we can take a simpler, but manual path. Intuition tells us that there should not be too many keys.

We build a set with known keys. After that we filter out names with known titles, and print first twenty of them. We see that we have already "Mrs" and "Mr". Let's find others.

// build incrementally a set with known keys
HashSet<String> keys = new HashSet<>();
keys.add("Mrs");
keys.add("Mr");

// filter out names with known keys
// print first twenty to inspect and see other keys
train.getVar("Name").stream()
        .mapToString()
        .filter(txt -> {
            for(String key : keys)
                if(txt.contains(" " + key + ". "))
                    return false;
            return true;
        })
        .limit(20)
        .forEach(WS::println);

"Heikkinen, Miss. Laina"
"Palsson, Master. Gosta Leonard"
"Sandstrom, Miss. Marguerite Rut"
"Bonnell, Miss. Elizabeth"
"Vestrom, Miss. Hulda Amanda Adolfina"
"Rice, Master. Eugene"
"McGowan, Miss. Anna 'Annie'"
"Palsson, Miss. Torborg Danira"
"O'Dwyer, Miss. Ellen 'Nellie'"
"Uruchurtu, Don. Manuel E"
"Glynn, Miss. Mary Agatha"
"Vander Planke, Miss. Augusta Maria"
"Nicola-Yarred, Miss. Jamila"
"Laroche, Miss. Simonne Marie Anne Andree"
"Devaney, Miss. Margaret Delia"
"O'Driscoll, Miss. Bridget"
"Panula, Master. Juha Niilo"
"Rugg, Miss. Emily"
"West, Miss. Constance Mirium"
"Goodwin, Master. William Frederick"

We reduced our search and found other titles like "Miss", "Master". We arrive at the following set of keys:

HashSet<String> keys = new HashSet<>();
keys.addAll(Arrays.asList(
        "Mrs", "Mme", "Lady", "Countess", "Mr", "Sir",
        "Don", "Ms", "Miss", "Mlle", "Master", "Dr",
        "Col", "Major", "Jonkheer", "Capt", "Rev"));

Nominal title = train.var("Name").stream()
        .mapToString()
        .map(txt -> {
            for(String key : keys)
                if(txt.contains(" " + key + ". "))
                    return key;
            return "?";
        })
        .collect(Nominal.collector());
DVector.fromCount(true, title).printSummary();

     ?      Mr     Mrs    Miss Master   Don   Rev    Dr   Mme    Ms Major  Lady   Sir  Mlle   Col  Capt Countess Jonkheer
     -      --     ---    ---- ------   ---   ---    --   ---    -- -----  ----   ---  ----   ---  ---- -------- --------
 0.000 517.000 125.000 182.000 40.000 1.000 6.000 7.000 1.000 1.000 2.000 1.000 1.000 2.000 2.000 1.000    1.000    1.000

We note that we exhausted training data. This is enough. It is possible that in test data to appear new titles. We will consider them missing values. That is why we return "?" when no matching is found. Another thing to notice is that some of the labels have few number of appearances. We will merge them in a greater category.

Another useful feature built in rapaio is filters. There are two types of filters: variable filters and frame filters. The nice part of frame filters is that learning algorithms are able to use frame filters naturally, in order to make feature transformations on data. This kind of filters are called input filters from the learning algorithm perspective. It is important that you know that input filters transforms features before train phase and also on fit phase.

We will build a learning filter to create a new feature.

/**
 * Frame filter which adds a title variable based on name variable
 */
class TitleFilter implements FFilter {

    private static final long serialVersionUID = -3496753631972757415L;

    private HashMap<String, String[]> replaceMap = new HashMap<>();
    private Function<String, String> titleFun = txt -> {
        for (Map.Entry<String, String[]> e : replaceMap.entrySet()) {
            for (int i = 0; i < e.getValue().length; i++) {
                if (txt.contains(" " + e.getValue()[i] + ". "))
                    return e.getKey();
            }
        }
        return "?";
    };

    @Override
    public void fit(Frame df) {
        replaceMap.put("Mrs", new String[]{"Mrs", "Mme", "Lady", "Countess"});
        replaceMap.put("Mr", new String[]{"Mr", "Sir", "Don", "Ms"});
        replaceMap.put("Miss", new String[]{"Miss", "Mlle"});
        replaceMap.put("Master", new String[]{"Master"});
        replaceMap.put("Dr", new String[]{"Dr"});
        replaceMap.put("Military", new String[]{"Col", "Major", "Jonkheer", "Capt"});
        replaceMap.put("Rev", new String[]{"Rev"});
    }

    @Override
    public Frame apply(Frame df) {
        NominalVar title = NominalVar.empty(0, new ArrayList<>(replaceMap.keySet())).withName("Title");
        df.var("Name").stream().mapToString().forEach(name -> title.addLabel(titleFun.apply(name)));
        return df.bindVars(title);
    }
}

Now let's try a new random forest on the reduced data set and also on title.

RandomSource.setSeed(123);
CForest rf = CForest.newRF()
        .withInputFilters(
                new TitleFilter(),
                new FFMapVars("Survived,Sex,Pclass,Embarked,Title")
        )
        .withClassifier(CTree.newCART().withMinCount(3))
        .withRuns(100);
cv(train, rf);

rf.train(train, "Survived");
rf.printSummary();
CFit fit = rf.fit(test);
new Confusion(train.var("Survived"), rf.fit(train).firstClasses()).printSummary();
new Csv().withQuotes(false).write(SolidFrame.wrapOf(
        test.var("PassengerId"),
        fit.firstClasses().withName("Survived")
), root + "rf2-submit.csv");

Cross validation 10-fold
CV fold: 1, acc: 0.833333, mean: 0.833333, se: NaN
CV fold: 2, acc: 0.831461, mean: 0.832397, se: 0.001324
CV fold: 3, acc: 0.797753, mean: 0.820849, se: 0.020024
CV fold: 4, acc: 0.820225, mean: 0.820693, se: 0.016352
CV fold: 5, acc: 0.764045, mean: 0.809363, se: 0.029023
CV fold: 6, acc: 0.820225, mean: 0.811174, se: 0.026335
CV fold: 7, acc: 0.887640, mean: 0.822097, se: 0.037593
CV fold: 8, acc: 0.820225, mean: 0.821863, se: 0.034811
CV fold: 9, acc: 0.820225, mean: 0.821681, se: 0.032567
CV fold:10, acc: 0.831461, mean: 0.822659, se: 0.030860
=================
mean: 0.822659, se: 0.030860

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >520   29 |   549
     1 |  123 >219 |   342
 ----- |    -    - | -----
 total |  643  248 |   891


Complete cases 891 from 891
Acc: 0.8294052         (Accuracy )
F1:  0.8724832         (F1 score / F-measure)
MCC: 0.6375234         (Matthew correlation coefficient)
Pre: 0.8087092         (Precision)
Rec: 0.9471767         (Recall)
G:   0.8752088         (G-measure)

Now that looks definitely better than our best classifier. We submit that to kaggle to see the improvement.

Figure 1.8.5.1 Progress after incorporating title name into input features

Other features

There are various authors which published their work on solving this kaggle competition. Most interesting part of their work is the feature engineering section. I developed here some ideas in order to show how one can do this with the library.

Family size

Using directly "SibSp" and "Parch" fields yields no value for a random forest classifier. Studying this two features it looks like those values can be both combined into a single one by summation. This would give us a family size estimator.

In order to have an idea of the performance of this new estimator I used a ChiSquare independence test. The idea is to study if those features taken separately worth less than combined.

// convert sibsp and parch to nominal types to be able to use a chi-square test
NominalVar sibsp = NominalVar.from(train.getRowCount(), row -> train.getLabel(row, "SibSp"));
NominalVar parch = NominalVar.from(train.getRowCount(), row -> train.getLabel(row, "Parch"));

// test individually each feature
ChiSquareTest.independence(train.getVar("Survived"), sibsp).printSummary();
ChiSquareTest.independence(train.getVar("Survived"), parch).printSummary();

// build a combined feature by summation, as nominal
Nominal familySize = Nominal.from(train.rowCount(), 
    row -> "" + (1 + train.index(row, "SibSp") + train.index(row, "Parch")));

// run the chi-square test on sumation
ChiSquareTest.independence(train.var("Survived"), familySize).printSummary();

> ChiSquareTest.independence 

        Pearson’s Chi-squared test 

data:  
           1.0     0.0    3.0    4.0    2.0   5.0   8.0   total
     0  97.000 398.000 12.000 15.000 15.000 5.000 7.000 549.000
     1 112.000 210.000  4.000  3.000 13.000 0.000 0.000 342.000
 total 209.000 608.000 16.000 18.000 28.000 5.000 7.000 891.000


X-squared = 37.2717929, df = 6, p-value = 1.5585810465568173E-6

> ChiSquareTest.independence 

        Pearson’s Chi-squared test 

data:  
           0.0     1.0    2.0   5.0   3.0   4.0   6.0   total
     0 445.000  53.000 40.000 4.000 2.000 4.000 1.000 549.000
     1 233.000  65.000 40.000 1.000 3.000 0.000 0.000 342.000
 total 678.000 118.000 80.000 5.000 5.000 4.000 1.000 891.000


X-squared = 27.9257841, df = 6, p-value = 9.703526421045439E-5

> ChiSquareTest.independence 

        Pearson’s Chi-squared test 

data:  
             2       1      5       3      7      6      4     8    11   total
     0  72.000 374.000 12.000  43.000  8.000 19.000  8.000 6.000 7.000 549.000
     1  89.000 163.000  3.000  59.000  4.000  3.000 21.000 0.000 0.000 342.000
 total 161.000 537.000 15.000 102.000 12.000 22.000 29.000 6.000 7.000 891.000


X-squared = 80.6723134, df = 8, p-value = 3.574918139293004E-14

How we can interpret the result? The test says that each feature brings value separately. The p-value associated with both test could be considered significant. That means there are string evidence that those features are not independent of target class. As a conclusion, those features are useful. The last test is made for their summation. It looks like he test is more significant than the previous two. As a consequence we can use the summation instead of those two values taken independently.

Cabin and Ticket

It seems that cabin and ticket denominations are not useful as they are. There are some various reasons why not to do so. First of all they have many missing values. But a stringer reason is that both have too many levels to contain solid generalization base for learning.

If we take only the first letter from each of those two fields, more generalization can happen. This is probably due to the fact that perhaps there is some localization information encoded in those. Perhaps information about the deck, the comfort level, auxiliary functions is encoded in there. As a conclusion it worth a try so we proceed with this thing.

To combine all those things we can do it in a single filter or into many filters applied on data. I chose to do a single custom filter to solve all those problems.

class CustomFilter implements FFilter {

    private static final long serialVersionUID = -3496753631972757415L;

    private HashMap<String, String[]> replaceMap = new HashMap<>();
    private Function<String, String> titleFun = txt -> {
        for (Map.Entry<String, String[]> e : replaceMap.entrySet()) {
            for (int i = 0; i < e.getValue().length; i++) {
                if (txt.contains(" " + e.getValue()[i] + ". "))
                    return e.getKey();
            }
        }
        return "?";
    };

    @Override
    public void fit(Frame df) {
        replaceMap.put("Mrs", new String[]{"Mrs", "Mme", "Lady", "Countess"});
        replaceMap.put("Mr", new String[]{"Mr", "Sir", "Don", "Ms"});
        replaceMap.put("Miss", new String[]{"Miss", "Mlle"});
        replaceMap.put("Master", new String[]{"Master"});
        replaceMap.put("Dr", new String[]{"Dr"});
        replaceMap.put("Military", new String[]{"Col", "Major", "Jonkheer", "Capt"});
        replaceMap.put("Rev", new String[]{"Rev"});
    }

    @Override
    public Frame apply(Frame df) {

        NominalVar title = NominalVar.empty(0, new ArrayList<>(replaceMap.keySet())).withName("Title");
        df.getVar("Name").stream().mapToString().forEach(name -> title.addLabel(titleFun.apply(name)));

        Var famSize = NumericVar.from(df.rowCount(), row ->
                1.0 + df.getIndex(row, "SibSp") + df.getIndex(row, "Parch")
        ).withName("FamilySize");

        Var ticket = NominalVar.from(df.rowCount(), row ->
                df.isMissing(row, "Ticket") ? "?" : df.getLabel(row, "Ticket")
                    .substring(0, 1).toUpperCase()
        ).withName("Ticket");

        Var cabin = NominalVar.from(df.rowCount(), row ->
                df.isMissing(row, "Cabin") ? "?" : (df.getLabel(row, "Cabin")
                    .substring(0, 1).toUpperCase())
        ).withName("Cabin");

        return df.removeVars("Ticket,Cabin").bindVars(famSize, ticket, cabin, title).solidCopy();
    }
}

Another try with random forests

So we have some new features and we look to learn from them. We can use a previous classifier like random forests to test it before submit.

But we know that we are in danger to overfit is rf. One idea is to transform numeric features into nominal ones by a process named discretization. For this purpose we use a filter from the library called FFQuantileDiscrete. This filter computes a given number of quantile intervals and put labels according with those intervals on numerical values. Let's see how we proceed and how the data looks like:

FFilter[] inputFilters = new FFilter[]{
        new CustomFilter(),
        new FFQuantileDiscrete(10, "Age"),
        new FFQuantileDiscrete(10, "Fare"),
        new FFQuantileDiscrete(3, "SibSp"),
        new FFQuantileDiscrete(3, "Parch"),
        new FFQuantileDiscrete(8, "FamilySize"),
        new FFMapVars("Survived,Sex,Pclass,Embarked,Title,Age,Fare,FamilySize,Ticket,Cabin")
};

// print a summary of the transformed data
train.applyFilters(inputFilters).printSummary();

> printSummary(frame, [Survived, Sex, Pclass, Embarked, Title, Age, Fare, FamilySize, 
Ticket, Cabin])
rowCount: 891
complete: 183/891
varCount: 10
varNames: 

 0. Survived : NOMINAL |  4.      Title : NOMINAL |  8. Ticket : NOMINAL | 
 1.      Sex : NOMINAL |  5.        Age : NOMINAL |  9.  Cabin : NOMINAL | 
 2.   Pclass : NUMERIC |  6.       Fare : NOMINAL |                        
 3. Embarked : NOMINAL |  7. FamilySize : NOMINAL |                        

Survived           Sex           Pclass    Embarked          Title            Age 
 0 : 549    male : 577     Min. : 1.000     S : 644       Mr : 520  31.8~36 :  91 
 1 : 342  female : 314  1st Qu. : 2.000     C : 168     Miss : 184    14~19 :  87 
                         Median : 3.000     Q :  77      Mrs : 128    41~50 :  78 
                           Mean : 2.309  NA's :   2   Master :  40  -Inf~14 :  77 
                        2nd Qu. : 3.000                   Dr :   7    22~25 :  70 
                           Max. : 3.000                  Rev :   6  (Other) : 244 
                                                     (Other) :   6     NA's : 177 
               Fare    FamilySize         Ticket          Cabin 
   7.854~8.05 : 106  -Inf~1 : 537        3 : 301        C :  59 
    -Inf~7.55 :  92     1~2 : 161        2 : 183        B :  47 
    27~39.688 :  91     2~3 : 102        1 : 146        D :  33 
    21.679~27 :  89   3~Inf :  91        P :  65        E :  32 
39.688~77.958 :  89                      S :  65        A :  15 
14.454~21.679 :  88                      C :  47  (Other) :   5 
      (Other) : 336                (Other) :  84     NA's : 687

We can see that "Age" values are now intervals and still $177$ missing values.

The values chosen for quantile numbers is more or less arbitrary. There is no good numbers in general, only for some specific purposes.

As promised, we will give a try to another random forest to see if it can better generalize.

RandomSource.setSeed(123);

CForest model = CForest.newRF()
        .withInputFilters(inputFilters)
        .withMCols(4)
        .withBootstrap(0.7)
        .withClassifier(CTree.newCART()
            .withFunction(CTreePurityFunction.GainRatio).withMinGain(0.001))
        .withRuns(200);
model.train(train, "Survived");
CFit fit = model.fit(test);
new Confusion(train.getVar("Survived"), model.fit(train).firstClasses()).printSummary();
cv(train, model);

I tried some ideas to make the forest to generalize better.

Smaller bootstrap percentage - this could lead to increased independence of trees
Use GainRatio as purity function because sometimes is more conservative
Use MinGain to avoid growing trees to have many leaves with a single instance
Use mCols=4, number of variables used for testing - more than default value, to improve the quality of each tree

> Confusion

 Ac\Pr |    0    1 | total
 ----- |    -    - | -----
     0 | >532   17 |   549
     1 |   33 >309 |   342
 ----- |    -    - | -----
 total |  565  326 |   891


Complete cases 891 from 891
Acc: 0.9438833         (Accuracy )
F1:  0.9551167         (F1 score / F-measure)
MCC: 0.880954         (Matthew correlation coefficient)
Pre: 0.9415929         (Precision)
Rec: 0.9690346         (Recall)
G:   0.9552152         (G-measure)
Cross validation 10-fold
CV fold: 1, acc: 0.833333, mean: 0.833333, se: NaN
CV fold: 2, acc: 0.786517, mean: 0.809925, se: 0.033104
CV fold: 3, acc: 0.842697, mean: 0.820849, se: 0.030099
CV fold: 4, acc: 0.898876, mean: 0.840356, se: 0.046109
CV fold: 5, acc: 0.831461, mean: 0.838577, se: 0.040129
CV fold: 6, acc: 0.842697, mean: 0.839263, se: 0.035932
CV fold: 7, acc: 0.853933, mean: 0.841359, se: 0.033267
CV fold: 8, acc: 0.764045, mean: 0.831695, se: 0.041179
CV fold: 9, acc: 0.842697, mean: 0.832917, se: 0.038694
CV fold:10, acc: 0.842697, mean: 0.833895, se: 0.036612
=================
mean: 0.833895, se: 0.036612

These are the results. At a first look might seem like an astonishing result. But we know that the irreducible error for this data set is high and is close to $0.2$ . It seems obvious that we failed to reduce the variance and we still overfit a lot using this construct. Since this is a tutorial I will not insist on improving this model, but I think that even if it would be improved, the gain would be very small. Perhaps another approach would be better.

Feature engineering

Feature engineering

Title feature

Other features

Family size

Cabin and Ticket

Another try with random forests

results matching ""

No results matching ""