Use Hadoop and BigR to solve a Kaggle Competition

I previously explained how to use DashDB on Bluemix as infrastructure for a Kaggle competition. In this post I’m going to explain how to use Hadoop instead of DashDB.  Why should I use it? When the volume of data is HUGE, using Hadoop is always a good option. Using the service ‘IBM Analytics for Hadoop‘ (powered by BigInsights) is in my opinion the best option…why?

– First because IBM offers 20 GB of data storage for free!!! And this is VERY COOL!

– Second, you can use BigR. IBM InfoSphere BigInsights BigR is a library of functions that provide end-to-end integration with the R language and InfoSphere BigInsights. This is new in the Bluemix service, available since the last update.

All the environment is READY to run R, with R installed in all the nodes of the Hadoop Cluster.

In this post I’m going to show how to set up the environment to solve the challenge ‘Click-Through Rate Prediction‘ with an award of $15,000. And I repeat…all this cloud powerful infrastructure for FREE!

Hadoop

Getting started

As commented in the previous article, you need to create a free IBM Bluemix account. Then you will be able to create an instance of ‘IBM Analytics for Hadoop‘ with 20GB of storage for free.  When you create the instance you will get automatically all the credentials of your Hadoop instance:

HadoopCredentials

When it is ready, you will be able to access the administration console. In the main page you can download the ‘bigr’ R package:

bigR

This will download a file called bigr-1.0.tar.gz.

Install the package in RStudio and load your data

Of course I will use RStudio to develop in R. In order to install the package from a file, and not from the CRAN network, I used the following commands:

 install.packages(‘bigr-1.0.tar.gz’, repos = NULL, type=”source”)

Once it is done, you can connect to your Hadoop instance doing the following:

library(‘bigr’)

conn <- bigr.connect(host=”bi-hadoop-prod-399.services.dal.bluemix.net”, port=7052, database=”default”, user=”biblumix”, password=”xxxxxx”)

The data of this exercise is 5.87GB. My machine is 16GB memory so I can run it all in memory. To load the data from the csv file and then upload it into the Hadoop instance on Bluemix I will run the following commands:

dataR <- read.csv(“train.csv”, stringsAsFactors=F)

data<- as.bigr.frame(dataR )

data<- bigr.persist(data, dataSource=”DEL”, dataPath=”train.csv”, header=T, delimiter=”,”, useMapReduce=F)

If you log in in the Administration Console of BigInsights on Bluemix, you will see your CSV in the Cloud. The process of loading the data can take some time due to the big size of the file.

loaded

Now you are all set to start using the power of R and Hadoop. Using BigR you will skip all the complexity of manually writing MapReduce jobs.

Find here very useful documentation:

Analyzing data with IBM InfoSphere BigInsights Big R

Run Ad hoc R Scripts

Even if I always prefer to use RStudio, you might want to run a R Script that you previously developed and save it. You can do that. In the Administration Console of BigInsights –> Application you have a pre-built application to insert you R Script.

AdhocScript

Share this:

3 thoughts on “Use Hadoop and BigR to solve a Kaggle Competition

  1. How do you do Machine Learning on big?
    I’ve tried with the package “caret” or just tried randomForest and none of them work. I get the following error:

    Error in as.data.frame.default(data) :
    cannot coerce class “structure(“bigr.frame”, package = “bigr”)” to a data.frame

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">