R Shiny Application to Predict Survival on the Titanic

One of the most popular exercises to get started with Data Mining is predicting the survival on the Titanic. In the Kaggle website this is one of the main challenges, and you can find accurate documentation and tutorials on how to solve it using Excel, Python, R…

In the IBM Extreme Blue team that I leaded last summer the 4 students got started on Data Mining doing this challenge, and we end up creating a Shiny R application.  Shiny is a web application framework for R to convert your analysis into interactive web applications without having to write in HTML, CSS or JavaScript. We not only worked in solving the problem with R and with SPSS getting very good result, we created this application so everybody can access online and see the results. The model is created on the cloud and the results are calculated in as well in the cloud.


The application is hosted on Bluemix. To do so, we are using the Cloud Foundry R Custom Buildpack I modified to install R 3.1: https://github.com/aruizga7/cf-buildpack-r

You have the instructions about how to run R Shiny on Bluemix in this article: https://bluemixanalytics.wordpress.com/2014/09/11/new-developerworks-article-about-running-r-shiny-on-bluemix/

You can find the source code of this Shiny Application here: https://github.com/aruizga7/TitanicShinyApp.git

Continue reading

Use #Bluemix DashDB and R to solve a Kaggle Competition

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model.

There are very tempting awards for the winners:

Kaggle1Anyone can nowadays start solving these challenges, and there are powerful tools like SPSS Modeler that within a few click rank you in the top 5% of the Ranking.

One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix band DashDB!

With IBM Bluemix service DashDB is one of the best solutions to do so. Why?

– DashDB is powered by IBM BLU Acceleration and Netezza in-Database Analytics. It uses dynamic in-memory columnar technology and innovations such as actionable compression to rapidly scan and return relevant data. In-database analytic algorithms integrated from Netezza bring simplicity and performance to advanced analytics.

-You can get started for Free. You get a free account now on Bluemix.net. You get for free at No charge 1GB of data stored. After that, you pay as you grow. 1 GB to 10 GB of available compressed database storage that can hold, respectively, from 5 GB to 50 GB of uncompressed data, based on typical compression ratios. The compression ratio for your data varies based on the characteristics and values in your data set.

-Full integration with R: You can not only run R scripts but also open an instance of the best R Development Environment RStudio completely embedded in the web-browser. You don’t need to install any software on your computer, you can do it all in the Cloud!


Continue reading

Connect Google BigQuery to IBM SPSS Modeler using JDBC with R

If you want to mine your data using IBM SPSS Modeler and your data is stored in the Google Cloud, you can do it and I will show you how in this post.

I am not going to explain how valuable is to use the cloud, and how cool is to set up an Hadoop Cluster using IBM, Amazon, Google or any other’s cloud. In seconds you can have your infrastructure ready to use. So if you are dealing with big amounts of data you might need to mine it…and for this…the best to use is IBM SPSS Modeler!

There are (in my opinion), four different ways to connect to Google BigQuery and IBM SPSS Modeler:

1. R –>There is a package called bigrquery https://github.com/hadley/bigrquery

You cannot use it yet in Modeler 16 because it uses R 2.15 and you need 3.1 to install this package. But for the next release we will be able to install this package and connecting to BigQuery and create an extension node will be very easy.

2. JDBC –> There is a JDBC Open Source driver to do that, and you can find it here: https://code.google.com/p/starschema-bigquery-jdbc/

I created an  extension  for SPSS Modeler using R to connect toBigQuery through JDBC. It is less direct than using the bigrquery package of the previous point, but still quite easy to do. Here you can see how it looks like, using the Custom Dialog Builder I created the user interface and it is as easy as selecting your projectID, UserID, KeyID and then writing the Query.


3. If you are willing to pay, there are some companies that developed ODBC Drivers to connect to BigQuery: http://www.simba.com/connectors/google-bigquery-odbc

4. The best way might be using IBM SPSS Analytic Server, but BigQuery is not yet supported (but should be possible to implement).

Continue reading

Seven quick facts about R

  1. R is the highest paid IT skill (Dice.com survey, January 2014)
  2. R most-used data science language after SQL (O’Reilly survey, January 2014)
  3. R is used by 70% of data miners (Rexer survey, October 2013)
  4. R is #15 of all programming languages (RedMonk language rankings, January 2014)
  5. R growing faster than any other data science language (KDNuggets survey, August 2013)
  6. R is the #1 Google Search for Advanced Analytics software (Google Trends, March 2014)
  7. R has more than 2 million users worldwide (Oracle estimate, February 2012)

Source: http://blog.revolutionanalytics.com/r-is-hot/

Join Wednesday a Webinar IBM SPSS + R

Combining SPSS Statistics and R can give you the best of both worlds. SPSS software complements R by extending R’s scalability. You can handle much larger data sets. You can distribute R packages to a wide range of users including non-programmers. Here are some benefits for R users, when they integrate with SPSS Statistics:

  • Better interface
  • Reduced learning curve
  • Data access and transformation is much easier
  • Better output options
  • Superior performance
  • Collaborate better

At the end of this webinar you will see how extending the strengths of R with SPSS software just makes sense.

Register for the webinar to learn more about how combining both is a great solution for your advanced data analysis.


Murali Prakash
Market Manager-Products & Capabilities Predictive & Business Intelligence

Jon Peck
Senior Software Engineer at IBM

IBM SPSS + R  = Powerful Analytical Combination: https://event.on24.com/eventRegistration/EventLobbyServlet?target=registration.jsp&eventid=863536&sessionid=1&key=A83405E714B0EEC85FF77E51F8B79117&sourcepage=register

SPSS Modeler and R integration – Getting started

Since I have a new computer and I have to install everything from scratch, I will explain how to install and set up IBM SPSS Modeler to work with R.

I found a post on this R-Bloggers talking about  ‘Thoughts on SPSS and R Integration‘. The post is very good because it is going step by step about how to integrate SPSS and R and it is explaining how embarrassingly not obvious was to do it. But this post is dated in March 10, 2012 and a lot happen since then.

In this tutorial I am going to focus on IBM SPSS Modeler version 16, the last available and with full integration with R. You can also use IBM SPSS Modeler 15 FP2, and you will have some integration but not the full.

Continue reading

Twitter and IBM Partnership – IBM Insight 2014 Big Announcement

The conference is over and it is time to make conclusions. It is been an exciting week and I am proud to have seen the moment when IBM and Twitter announced their partnership… in my opinion the most important announce during the conference Insight 2014.
How was it? It was Wednesday 29th of October, I was feeling a bit stressed because at 10h I had my first intervention ever as speaker in such a conference like IBM Insight. Suddenly on the stage, the Twitter’s Vice President of Data Strategy and Ex-CEO of GNIP (cool company adquired by Twitter last year). He was explaining the importance of analysing data, and talking about concepts like Internet of Things…and then he said…imagine instrumenting the brain of all the people, and analyzing all this data in an efficient way. And…BOOOM!!! They announced the partnership with and all the crowd was really excited. The best and most valuable data source with the most powerful analytics engine and software partner together to bring this new source of business insight.

Continue reading

Space Time Boxes announced as Big Thing in Insight 2014

Last 18th of March I published in DeveloperWorks and article about Space Time Boxes (STB) using IBM SPSS Modeler 16.  I invite you to read it and learn more in the following link.



This year at Insight 2014 I have seen Space Time Boxes integrated in many more products and I’ve seen many use cases, included an amazing one in the General Session of 30th of October with the IBM Felow Jeff Jonas. He explained how they use STB to track the boats in Singapore.

Geospatial analytics is gaining importance, there are many new features in SPSS Modeler and SPSS Statistics related to geospatial algorithms and visualization tools. Will talk about them when the new releases of this products are available!!!

Geospatial Analytics with SPSS Modeler #IBMInsight

I was glad to attend to the session of Geospatial Analytics with SPSS Modeler and see a lot of familiar stuff…basically I saw a demo of Space-Time-Boxes. Exactly the same demo with the same dataset is in the article I wrote in DeveloperWorks some months ago and you can enjoy here:

-Mine spatial data with space-time-boxes in IBM SPSS Modeler: http://www.ibm.com/developerworks/library/ba-mine-spatial-data-spss-r

Then I’ve seen as well some interested new Geospatial nodes based on R…and this nodes were developed by my team last summer :-)  It is a big pleasure come out with an idea, work on it, and see it presented in front of so many people and industry leaders in the main conference of Big Data and Analytics.