SPSS Marketplace catalog in AnalyticsZone – R extensions available!

Christmas gift for all the IBM SPSS Community :-)

After many months of work, we have available a catalog of extensions for IBM SPSS Modeler in the AnalyticsZone.  As you might know already, you can integrate R in Modeler and you can create new SPSS Nodes running R code behind.

I’ve been working in the development of more than 30 extensions for IBM SPSS Modeler using R.  By today, we are publishing the first 10, and we have many more coming! All of them have documentation, examples and video tutorials in a new Youtube Channel called IBM Predictive Extensions.

Enter here the Marketplace –> Link


You can learn more here: R usable for non-programmer Predictive Extensions for IBM SPSS Modeler now available

Please let me know if you have any new ideas or extensions that you would like to post in the Marketplace. Enjoy!

Watson Analytics giving insights on the Titanic Survival

Yesterday I posted a post about how to solve the Titanic Survival challenge that you can find in the Kaggle website. Today I want to show for first time IBM Watson Analytics and how rapidly get some insight on this same use case.

IBM Watson Analytics is available as a service and you can use it with Freemium license.  It combines search, content analytics, and cognitive computing. Just upload your dataset, in this case the train.csv, select the target (survival) and Watson Explorer is going to start analyzing this dataset and give you 360-degree information.

Get started for free here: http://watsonanalytics.com

Here you have some nice screenshots:

WatsonExplorerMain Continue reading

R Shiny Application to Predict Survival on the Titanic

One of the most popular exercises to get started with Data Mining is predicting the survival on the Titanic. In the Kaggle website this is one of the main challenges, and you can find accurate documentation and tutorials on how to solve it using Excel, Python, R…

In the IBM Extreme Blue team that I leaded last summer the 4 students got started on Data Mining doing this challenge, and we end up creating a Shiny R application.  Shiny is a web application framework for R to convert your analysis into interactive web applications without having to write in HTML, CSS or JavaScript. We not only worked in solving the problem with R and with SPSS getting very good result, we created this application so everybody can access online and see the results. The model is created on the cloud and the results are calculated in as well in the cloud.


The application is hosted on Bluemix. To do so, we are using the Cloud Foundry R Custom Buildpack I modified to install R 3.1: https://github.com/aruizga7/cf-buildpack-r

You have the instructions about how to run R Shiny on Bluemix in this article: https://bluemixanalytics.wordpress.com/2014/09/11/new-developerworks-article-about-running-r-shiny-on-bluemix/

You can find the source code of this Shiny Application here: https://github.com/aruizga7/TitanicShinyApp.git

Continue reading

Use Hadoop and BigR to solve a Kaggle Competition

I previously explained how to use DashDB on Bluemix as infrastructure for a Kaggle competition. In this post I’m going to explain how to use Hadoop instead of DashDB.  Why should I use it? When the volume of data is HUGE, using Hadoop is always a good option. Using the service ‘IBM Analytics for Hadoop‘ (powered by BigInsights) is in my opinion the best option…why?

– First because IBM offers 20 GB of data storage for free!!! And this is VERY COOL!

– Second, you can use BigR. IBM InfoSphere BigInsights BigR is a library of functions that provide end-to-end integration with the R language and InfoSphere BigInsights. This is new in the Bluemix service, available since the last update.

All the environment is READY to run R, with R installed in all the nodes of the Hadoop Cluster.

In this post I’m going to show how to set up the environment to solve the challenge ‘Click-Through Rate Prediction‘ with an award of $15,000. And I repeat…all this cloud powerful infrastructure for FREE!


Continue reading

Use #Bluemix DashDB and R to solve a Kaggle Competition

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model.

There are very tempting awards for the winners:

Kaggle1Anyone can nowadays start solving these challenges, and there are powerful tools like SPSS Modeler that within a few click rank you in the top 5% of the Ranking.

One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix band DashDB!

With IBM Bluemix service DashDB is one of the best solutions to do so. Why?

– DashDB is powered by IBM BLU Acceleration and Netezza in-Database Analytics. It uses dynamic in-memory columnar technology and innovations such as actionable compression to rapidly scan and return relevant data. In-database analytic algorithms integrated from Netezza bring simplicity and performance to advanced analytics.

-You can get started for Free. You get a free account now on Bluemix.net. You get for free at No charge 1GB of data stored. After that, you pay as you grow. 1 GB to 10 GB of available compressed database storage that can hold, respectively, from 5 GB to 50 GB of uncompressed data, based on typical compression ratios. The compression ratio for your data varies based on the characteristics and values in your data set.

-Full integration with R: You can not only run R scripts but also open an instance of the best R Development Environment RStudio completely embedded in the web-browser. You don’t need to install any software on your computer, you can do it all in the Cloud!


Continue reading

Connect Google BigQuery to IBM SPSS Modeler using JDBC with R

If you want to mine your data using IBM SPSS Modeler and your data is stored in the Google Cloud, you can do it and I will show you how in this post.

I am not going to explain how valuable is to use the cloud, and how cool is to set up an Hadoop Cluster using IBM, Amazon, Google or any other’s cloud. In seconds you can have your infrastructure ready to use. So if you are dealing with big amounts of data you might need to mine it…and for this…the best to use is IBM SPSS Modeler!

There are (in my opinion), four different ways to connect to Google BigQuery and IBM SPSS Modeler:

1. R –>There is a package called bigrquery https://github.com/hadley/bigrquery

You cannot use it yet in Modeler 16 because it uses R 2.15 and you need 3.1 to install this package. But for the next release we will be able to install this package and connecting to BigQuery and create an extension node will be very easy.

2. JDBC –> There is a JDBC Open Source driver to do that, and you can find it here: https://code.google.com/p/starschema-bigquery-jdbc/

I created an  extension  for SPSS Modeler using R to connect toBigQuery through JDBC. It is less direct than using the bigrquery package of the previous point, but still quite easy to do. Here you can see how it looks like, using the Custom Dialog Builder I created the user interface and it is as easy as selecting your projectID, UserID, KeyID and then writing the Query.


3. If you are willing to pay, there are some companies that developed ODBC Drivers to connect to BigQuery: http://www.simba.com/connectors/google-bigquery-odbc

4. The best way might be using IBM SPSS Analytic Server, but BigQuery is not yet supported (but should be possible to implement).

Continue reading

Seven quick facts about R

  1. R is the highest paid IT skill (Dice.com survey, January 2014)
  2. R most-used data science language after SQL (O’Reilly survey, January 2014)
  3. R is used by 70% of data miners (Rexer survey, October 2013)
  4. R is #15 of all programming languages (RedMonk language rankings, January 2014)
  5. R growing faster than any other data science language (KDNuggets survey, August 2013)
  6. R is the #1 Google Search for Advanced Analytics software (Google Trends, March 2014)
  7. R has more than 2 million users worldwide (Oracle estimate, February 2012)

Source: http://blog.revolutionanalytics.com/r-is-hot/

Rave Viz compatible with Bluemix

For those willing to create shiny visualizations, at IBM there is an engine that is powering the Watson products called Rave. Rave means Rapidly Adaptive Visualization Engine and it is fully integrated with IBM Cognos and it is compatible with the newest web-browsers and mobile devices. You can learn more about it here:


I am very excited to show for first time a sample IBM Bluemix App using Rave Engine. Now your apps can integrate it easily…. it is not only about doing powerful analytics…it is also about showing the results in a clear manner!

Here are some sample:


You will find as well learning material: