Variables, frames and data manipulation
There are two main data structures used all over the place: variables and frames. A variable is a list of values. You can think of a variable as a column in a tabular data format. A set of variables is a data frame. You can think about a frame as a table, with rows for observations and columns for variables.
Let's take a simple example. We will load the iris data set, which is already contained in the library.
Frame df = Datasets.loadIrisDataset();
df.printSummary();
Frame Summary
=============
* rowCount: 150
* complete: 150/150
* varCount: 5
* varNames:
0. sepal-length : num | 2. petal-length : num | 4. class : nom |
1. sepal-width : num | 3. petal-width : num |
sepal-length sepal-width petal-length petal-width class
Min. : 4.300 Min. : 2.000 Min. : 1.000 Min. : 0.100 setosa : 50
1st Qu. : 5.100 1st Qu. : 2.800 1st Qu. : 1.600 1st Qu. : 0.300 versicolor : 50
Median : 5.800 Median : 3.000 Median : 4.350 Median : 1.300 virginica : 50
Mean : 5.843 Mean : 3.057 Mean : 3.758 Mean : 1.199
2nd Qu. : 6.400 2nd Qu. : 3.300 2nd Qu. : 5.100 2nd Qu. : 1.800
Max. : 7.900 Max. : 4.400 Max. : 6.900 Max. : 2.500
Frame summary is a simple way to see some general information about a data frame. We see the data frame contains 150 observations. The data set contains five variables.
The listing continues with enumerating the name and type of the variables contained in a data frame. Notice that there are four numeric variables and one nominal variable, named class.
The summary listing ends with a section which describes each variable. For numeric variables the summary contains 6 number summary. These are some sample order statistics and sample mean. We have then the minimum and maximum value, the median value, the first and third quartile value and the mean value. For nominal values with have an enumeration of the first most frequent levels and the associated counts. For our class variable we see that there are three levels, each with instances.