Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model.
There are very tempting awards for the winners:
One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix band DashDB!
With IBM Bluemix service DashDB is one of the best solutions to do so. Why?
– DashDB is powered by IBM BLU Acceleration and Netezza in-Database Analytics. It uses dynamic in-memory columnar technology and innovations such as actionable compression to rapidly scan and return relevant data. In-database analytic algorithms integrated from Netezza bring simplicity and performance to advanced analytics.
-You can get started for Free. You get a free account now on Bluemix.net. You get for free at No charge 1GB of data stored. After that, you pay as you grow. 1 GB to 10 GB of available compressed database storage that can hold, respectively, from 5 GB to 50 GB of uncompressed data, based on typical compression ratios. The compression ratio for your data varies based on the characteristics and values in your data set.
-Full integration with R: You can not only run R scripts but also open an instance of the best R Development Environment RStudio completely embedded in the web-browser. You don’t need to install any software on your computer, you can do it all in the Cloud!
What I am going to explain is how to set up the environment to solve the challenge and not how to solve it. The Challenge I want to explain is “Click-Through Rate Prediction” by Avazu. The award for this challenge is $15,000. First thing to be done is to download the dataset.
– train: gz (1.04 gb) –> This file unzipped is 5.87 gb!!!
-test: .gz (118.07 mb)
In order to deal with this dataset, the best is to set up a Cloud Platform and I am going to explain how to do so using IBM Bluemix and Dash DB.
Start IBM Bluemix and DashDB
Create your account on Bluemix.net and create any kind of application. Then attach to the application a DashDB service. When you create the service you will get all the credentials of your DB instance:
You can find full documentation about how to set up the environment here: https://www.ng.bluemix.net/docs/#services/dashDB/index.html#dashDB
Then you can just click and access the DashDB Administration console that is very user friendly.
Loading the data
This might be the most “annoying” part. You can use different tools to do that, like SQuirreL SQL Client and just create a table and load the data. In my case I used SPSS Modeler to do that, using connection through ODBC. Feel free to contact me if you find issues in loading your data. Since the dataset is quite big, this process might take long time, all depend on your internet connection. But as soon as it is loaded, you will enjoy the power of DashDB!
Note as well that in the documentation in the Administration console of DashDB you have instructions to connect it to many different applications such as IBM SPSS, IBM Cognos, SAS, Tableau, RStudio…
Start analyzing your data
As I mentioned before, with dashDB you can run R scripts to develop statistical models and plot their results based on data in the dashDB database.
The RODBC and ibmdbR packages is all what you will need. You can download these packages and use them in your own RStudio instance or use it directly in the dashDB environment in the cloud, where they are already installed and immediately ready to use.
– RODBC is a package that provides functions that you can use to access the data in the dashDB database
-ibmdbR is a package that provides methods to read, write, and sample data from the dashDB database, and access methods for in-database analytics functions.
You can learn more about it in the documentation of dashDB and R
Now there is no excuse, you can use dashDB on Bluemix and R to solve the most difficult challenges of Kaggle!