Variables, frames and data manipulation

There are two main data structures used all over the place: variables and frames. A variable is a list of values. You can think of a variable as a column in a tabular data format. A set of variables is a data frame. You can think about a frame as a table, with rows for observations and columns for variables.

Let's take a simple example. We will load the iris data set, which is already contained in the library.

Frame df = Datasets.loadIrisDataset();
df.printSummary();
Frame Summary
=============
* rowCount: 150
* complete: 150/150
* varCount: 5
* varNames: 

 0. sepal-length : num | 2. petal-length : num | 4. class : nom |
 1.  sepal-width : num | 3.  petal-width : num |                 

   sepal-length      sepal-width     petal-length      petal-width            class 
   Min. : 4.300     Min. : 2.000     Min. : 1.000     Min. : 0.100      setosa : 50 
1st Qu. : 5.100  1st Qu. : 2.800  1st Qu. : 1.600  1st Qu. : 0.300  versicolor : 50 
 Median : 5.800   Median : 3.000   Median : 4.350   Median : 1.300   virginica : 50 
   Mean : 5.843     Mean : 3.057     Mean : 3.758     Mean : 1.199                  
2nd Qu. : 6.400  2nd Qu. : 3.300  2nd Qu. : 5.100  2nd Qu. : 1.800                  
   Max. : 7.900     Max. : 4.400     Max. : 6.900     Max. : 2.500

Frame summary is a simple way to see some general information about a data frame. We see the data frame contains 150 observations. The data set contains five variables.

The listing continues with enumerating the name and type of the variables contained in a data frame. Notice that there are four numeric variables and one nominal variable, named class.

The summary listing ends with a section which describes each variable. For numeric variables the summary contains 6 number summary. These are some sample order statistics and sample mean. We have then the minimum and maximum value, the median value, the first and third quartile value and the mean value. For nominal values with have an enumeration of the first most frequent levels and the associated counts. For our class variable we see that there are three levels, each with instances.

results matching ""

    No results matching ""