Data sets

For the purpose of this book some built-in data sets are used. Most of the built in data sets are available via rapaio.datasets.Datasets class. This is an utility class which provides some standard data sets used in many statistical and machine learning text books. There are a few which we will use for the purpose of this tutorial.

Iris data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphological variation of Iris flowers of three related species.

Two of the three species were collected in the Gaspé Peninsula all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). There are for measures for each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Frame iris = Datasets.loadIrisDataset();
iris.printSummary();

This is the summary of this data set:

Frame Summary
=============
* rowCount: 150
* complete: 150/150
* varCount: 5
* varNames: 

 0. sepal-length : num | 2. petal-length : num | 4. class : nom |
 1.  sepal-width : num | 3.  petal-width : num |                 

   sepal-length      sepal-width     petal-length      petal-width            class 
   Min. : 4.300     Min. : 2.000     Min. : 1.000     Min. : 0.100      setosa : 50 
1st Qu. : 5.100  1st Qu. : 2.800  1st Qu. : 1.600  1st Qu. : 0.300  versicolor : 50 
 Median : 5.800   Median : 3.000   Median : 4.350   Median : 1.300   virginica : 50 
   Mean : 5.843     Mean : 3.057     Mean : 3.758     Mean : 1.199                  
2nd Qu. : 6.400  2nd Qu. : 3.300  2nd Qu. : 5.100  2nd Qu. : 1.800                  
   Max. : 7.900     Max. : 4.400     Max. : 6.900     Max. : 2.500

Pearson's Height Data

This simple data set comes from a famous experiment by Karl Pearson around 1903. The number of cases is 1078. The original data values were rounded to produce heights to the nearest inch.

Frame fs = Datasets.loadPearsonHeightDataset();
fs.printSummary();

The summary of the data set shows the two numeric variables with heights for fathers and sons.

Frame Summary
=============
* rowCount: 1078
* complete: 1078/1078
* varCount: 2
* varNames: 

 0. Father : num | 1. Son : num |

          Father               Son 
   Min. : 59.000     Min. : 58.500 
1st Qu. : 65.800  1st Qu. : 66.900 
 Median : 67.800   Median : 68.600 
   Mean : 67.687     Mean : 68.684 
2nd Qu. : 69.600  2nd Qu. : 70.500 
   Max. : 75.400     Max. : 78.400
Classical father and son heights based on Pearson's original experiment from 1907
Figure 1.3.1 Classical father and son heights based on Pearson's original experiment from 1907

results matching ""

    No results matching ""