Variables

In statistics a variable has multiple meanings. A random variable is a process which produces values according to a distribution. Random variables can have multiple dimensions, so they can be grouped in vectors. The rapaio library does not model the concept of random variable. Instead the Var objects models the concept of values drawn from a unidimensional random variable, in other words a sample of values. As a consequence a Var object has a size and uses indexes to access the values from the sample.

A rapaio variable is a vector of values which have the same type and shares the same meaning. Each variable implements an interface called Var. Interface Var implements various useful methods for various kinds of tasks:

  • manipulate values from the variable by adding, removing, inserting and updating with different representations
  • naming a variable offers an alternate way to identify a variable into a frame and it is also useful for nice output information
  • manipulate sets of values by allowing variables concatenation by binding rows and filter out values by mapping
  • streaming allows traversal of variables by java 8 streams
  • other tools like deep copy, deep compare, summary, etc

VarType: storage and representation of a variable

There are two main concepts which have to be understood when working with variables: storage and representation. All the variables are able to store data inside using a certain Java data type, for example double, int, String, etc. One variable can use for storage and internal manipulations a single Java data type. In the same time, the data from variables can be represented in different ways, all of them being available through the Var interface for all variables.

However not all the representations are possible for all types of variables, because some of them does not make sense. For example double floating values can be represented as strings, which is fine, however strings in general cannot be represented as double values.

These are the following data representations all the Var-iables can implement:

  • value - double
  • label - String
  • index - int
  • stamp - long / Instant
  • binary - bool

The Var interface offers methods to get/update/insert values for all those data representations. Again, notice that not all data representations are available for all variables. For example the label representation is available for all sort of variables. This is acceptable, since when storing information into a text-like data format, any data type should be transformed into a string and should be read from a string representation.

To accomodate all those legal possibilities, the rapaio library has a set of predefined variable types, which can be found in the enum VarType.

The defined variable types:

  • BINARY - binary values
  • INDEX - integer values
  • NOMINAL - string values from a predefined set of values, with no ordering (for example: male, female)
  • ORDINAL - string values from a predefined set of values, with ordering assigned (for example: low, medium and high)
  • NUMERIC - double precision floating point values
  • STAMP - time related variables
  • TEXT - strings with free form

A data type is important for the following reasons:

  • gives a certain useful meaning for variables in such a way that machine learning or statistical algorithms can leverage to maximum potential the meta information about variables
  • encapsulates the stored data type artifacts and hide those details from the user, while allowing the usage of a single unitar interface for all variables

results matching ""

    No results matching ""