Introduction

Why another library for statistics and machine learning?

There are a lot of software stacks out there which provides plenty of nicely crafted tools for statistics, machine learning, data mining or pattern recognition. Many of them are available as open source, quality is high and they are full of reach features.

We have R which is a standard even for companies these days. R provides a language, an interactive data analysis workbench and solid visualization. R also incorporates the experience of many sound scientists who contributed to it.

There are Python stacks, which benefits from the fact that Python is a simple and elegant language and works excellent like a glue language. We have numpy, scipy, scikit-learn and plenty of other things. There is also matplotlib which is beautiful and flexible. There is also IPython which gives to the interactive analysis a new dimension.

There are also some other tools with different purposes or principles, like Weka, Shogun, Spark with MLib and so on. Some of them are have a lot of implemented algorithms. Some of them leverage distributed computation. Some of them exploits GPUs or other types of grids. And so on.

That being said, it appears like a legitimate question to ask "Why another library for statistics and machine learning?". Socrates said that "understanding a question is half of an answer". In our case the question would be complete if we append it with context. It now becomes: "Why another library for statistics and machine learning, when there are many available already?". Thus, a new question arises: why so many libraries out there?. My answer is: because none of them covers the taste and needs of everybody.

Some reasons which provide motivation for this library are:

  • I love R. I like functional paradigms, but I do not believe that the R language, like it is now, is something which I really want to program in for a long time. Take for example the error handling.
  • I love Python for its simplicity and elegance. However there are two main things which I really do not like. If you want something to work with, you have to have an almost complete operating system. Perhaps, this is because Python is really good as glue language. This advantage comes to a price. And this price is a plethora of things required. But more than that, I really hate things like making a graph shown in a window of a given size can be done in too many ways, many of them undocumented and hackish. And I really want a language with full and intelligent auto-complete, at any time. I prefer to memorize ideas that syntax.
  • I like Weka for its plenty of implementations. But I really do not like the standard Java way of doing things. There are no short methods. And by the way, not all implementations are complete (and the same thing can be said for Python stack).
  • The last but not least, I would really love to have an environment, a box for with plenty of tools, which can be extended, which allows me to experiment, study and learn.

Prerequisites

Rapaio library is written in Java. Since this is a continuous work in progress, the best way to use the library is to have the source code and use it locally. Of course, this is not the only possibility. One can get the published releases and use only the compiled artifacts. A list of published releases is on github.

As far as I understand, one can commit code for a github project in two ways. The first one is to clone a project you like and the second one is to commit directly into a project which either you own or you collaborate on.

Either way you have to get a link from github to the project. In case of collaboration, the direct link to rapaio project is https://github.com/padreati/rapaio.git. For a cloned repository you would get a link like https://github.com/your_user_name/rapaio, where your_user_name is your user name (sic!).

Install Intellij Idea

If you don't have it already, grab the "community edition" of IDEA from http://www.jetbrains.com/idea/download/. Extract the IntelliJ distribution somewhere convenient (e.g. under your home directory), and run bin/idea.sh to start the IDE. You should then see a series of step-by-step configuration dialog boxes, where you can enable or disable plugins. The less plugins you have activated, the more resource-efficient IDEA will be. The following steps will give you a minimal IDEA setup. You can always enable more plugins later if you need them.

  1. From the version control plugins you need only Git.
  2. From "Other plugins" you need GitHub and JUnit.

Super tip: In recent versions of Ubuntu, it is not straightforward in the default GUI to manually add application launcher icons. IDEA has a menu item to automatically create one at Tools -> Create desktop entry.

Install JDK

This library requires JDK 1.8 or newer. For testing the project it needs JUnit. Other dependencies are not required. The main reason is that the first of the only two principles which governs this library is Write yourself any feature you need, do not rely on external libraries. As a consequence, you need only JDK at runtime, and for testing purposes you need only JUnit. Any other thing is either a leverage of JDK (logging, image manipulation, graphics, etc) or written in library itself.

Setup rapaio from GitHub

  1. Create a directory which will hold rapaio modules and perhaps other modules. Suppose this directory is /local/workspace/ (but you can name it how you like it).

  2. IDEA allows you to check out Git repositories from GitHub in two ways. If you have already opened Idea and a project is loaded, you can set up a new project for rapaio from there, in which case you should use the menu VCS -> Checkout from Version Control -> GitHub. Otherwise, from the Welcome to Intellij IDEA dialog box, you should use Checkout from Version Control -> GitHub.

  3. In both cases you, the Login to GitHub dialog box will appear and you'll have to complete Host, Auth type: Password, Login (often the email) and Password.

  4. If you checked Save password, than you are asked for a master password. With this password the IDE encrypts its own repository of passwords.

  5. After successful login, the Clone repository dialog box allows you to select a repository. The required one is https://github.com/padreati/rapaio.git. Select a Parent directory to be something like /local/workspace, select Directory Name as rapaio and push Clone button.

  6. The IDE asks you if you want to create a project from sources. In my setup I select No since I want rapaio as a module. After selecting No, you will be back in the Welcome to Intellij IDEA screen.

  7. Now it's time to create the project. Select Create new project. In the New project modal dialog select Empty Project from the right panel and click Next and from the new screen select as Project location: /home/workspace, and perhaps Project Name: rapaio, and push Finish.

  8. IDE will create the new empty project for you (please remember that there are inside sources for rapaio) and open it. Because it's empty two things happens: it will show some warnings Unregistered VCS root detected which you should ignore it for now and it will show modal dialog Project Structure since the project is empty. In the Project Structure add a new module via + -> New module. In the New module screen a Module SDK, it has to be JDK 8 or greater (if none exists, than selects New -> JDK -> select JDK root -> OK). Then select Next. On the new modal dialog select a location for Content root to be /local/workspaces/rapaio, the other fields could remain the same, and click Finish.

  9. Now it's time to solve VCS root configuration problems. For that you will have to click Configure link from notifications. From the modal dialog you have to solve 2 things: remove project from VCS (select project and click - button) and register rapaio under VCS (select rapaio module and click + button).

  10. After all of these steps are accomplished, you can start to work. Final setups might need to change markings with right click and Mark directory as for src, tst or whatever. In order to experiment, you can create your own module, add rapaio as a module dependency and work with data.

Final note: please check if the sources for the project use JDK 8 or greater for language level. In case it is not properly configured (compilation errors or similar), you would have to go to Project Struncture -> Modules -> rapaio -> Sources and set the field Language level to 8 - Lambdas, types annotations, etc

results matching ""

    No results matching ""