Winner of Balon d’Or analyzed in 2 minutes

In the previous post I showed some new nodes based on R that can be installed in IBM SPSS Modeler. It takes 2 minutes to get the CSV with the results of the votes of Balon d’Or (that Cristiano Ronaldo won for 3rd time) and build a stream like the one below and get a word cloud, a pie chart in the form of iFrame easily shareable and a HTML table also in the form of iFrame. SImple analysis, but quite interesting, easy to share and easy to perform!

BalonDorStream

The hashtag analyzed to create this wordcloud is #BalonDor2014, based on 10,000 tweets:

Continue reading

Analysis of #CharlieHebdo sentiment with SPSS

With the horrible attack in Paris  in the Charlie Hebdo office, we are experiencing once more a new way to be informed about last news, this time powered by Twitter. It is amazing how fast people are sharing thoughts, photos, links, and absolutely everything. It thus becomes the data set of the world population’s mind in real time.

In this post I am going to show how to query tweets and do some simple analysis using IBM SPSS Modeler and the new SPSS Predictive Extensions based on R. All this analysis…without any coding at all!  We are going to do 3 things:

– Create a Word Cloud with a new WordCloud node based on the R wordcloud package.

– Integration of RCharts with IBM SPSS Modeler. RCharts (developed by Ramnath Vaidyanathan) was born as the initiative to bring powerful JavaScript visualization for R users. So they can now create these interactive charts without having JavaScript skills, only with R. With this integration within SPSS workbench, you don’t even need to know R in order to use them. Simply drag and drop the node and start getting powerful results that are easy to share. These are the libraries available now in IBM SPSS Modeler:

-Integration with the new R package HTMLWidgets. This package enable you to add new types of HTML output to R Markdown documents. There are different types of widgets like maps, charts, 3D scatterplots and more.

NewNodesSPSS

Continue reading

FCBarcelona vs Real Madrid Players Twitter Popularity

I created a simple visualization showing the number of followers of each of the players of FCBarcelona and Real Madrid.

In this case the result is a simple bar chart. Barcelona fans, don’t get mad, Messi doesn’t have Twitter account…he is more active on Facebook and I will generate the same kind of report based on Facebook data.

Below you will find the source code to do all this and in the coming days I will illustrate how to get exactly the same results but using only SPSS, some new R nodes and without coding at all. So…stay tuned!

Link to Full Screen Chart

Continue reading

SPSS Node to plot interactive maps

IBM SPSS Modeler already includes map capabilities but far away of being perfect. Now we can create beautiful maps in a matter of seconds and all in the same SPSS Modeler workbench thanks to the integration of SPSS Modeler with R Programming Language. The extensions are available in the SPSS Modeler Marketplace that we launched last week and they are free.

In the maps you can use the same color for all points or use a legend column to specify a color code. This legend may be categorical or continuous. Several color palettes are available (sequential, divergent, qualitative or monochrome) covering all possible use of the node.

More precisely, this node generates an HTML file which can be saved to a specific directory and/or opened in the default browser on execution. This html page is an interactive map, that is to say you can move, zoom in and out, etc. The R package used is called PlotGoogleMaps.

Download extension: Plot Spatial Data

Continue reading

R Shiny Application to Predict Survival on the Titanic

One of the most popular exercises to get started with Data Mining is predicting the survival on the Titanic. In the Kaggle website this is one of the main challenges, and you can find accurate documentation and tutorials on how to solve it using Excel, Python, R…

In the IBM Extreme Blue team that I leaded last summer the 4 students got started on Data Mining doing this challenge, and we end up creating a Shiny R application.  Shiny is a web application framework for R to convert your analysis into interactive web applications without having to write in HTML, CSS or JavaScript. We not only worked in solving the problem with R and with SPSS getting very good result, we created this application so everybody can access online and see the results. The model is created on the cloud and the results are calculated in as well in the cloud.

 http://titanicanalysis.mybluemix.net/

The application is hosted on Bluemix. To do so, we are using the Cloud Foundry R Custom Buildpack I modified to install R 3.1: https://github.com/aruizga7/cf-buildpack-r

You have the instructions about how to run R Shiny on Bluemix in this article: https://bluemixanalytics.wordpress.com/2014/09/11/new-developerworks-article-about-running-r-shiny-on-bluemix/

You can find the source code of this Shiny Application here: https://github.com/aruizga7/TitanicShinyApp.git

Continue reading

Use Hadoop and BigR to solve a Kaggle Competition

I previously explained how to use DashDB on Bluemix as infrastructure for a Kaggle competition. In this post I’m going to explain how to use Hadoop instead of DashDB.  Why should I use it? When the volume of data is HUGE, using Hadoop is always a good option. Using the service ‘IBM Analytics for Hadoop‘ (powered by BigInsights) is in my opinion the best option…why?

– First because IBM offers 20 GB of data storage for free!!! And this is VERY COOL!

– Second, you can use BigR. IBM InfoSphere BigInsights BigR is a library of functions that provide end-to-end integration with the R language and InfoSphere BigInsights. This is new in the Bluemix service, available since the last update.

All the environment is READY to run R, with R installed in all the nodes of the Hadoop Cluster.

In this post I’m going to show how to set up the environment to solve the challenge ‘Click-Through Rate Prediction‘ with an award of $15,000. And I repeat…all this cloud powerful infrastructure for FREE!

Hadoop

Continue reading

Use #Bluemix DashDB and R to solve a Kaggle Competition

Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model.

There are very tempting awards for the winners:

Kaggle1Anyone can nowadays start solving these challenges, and there are powerful tools like SPSS Modeler that within a few click rank you in the top 5% of the Ranking.

One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix band DashDB!

With IBM Bluemix service DashDB is one of the best solutions to do so. Why?

– DashDB is powered by IBM BLU Acceleration and Netezza in-Database Analytics. It uses dynamic in-memory columnar technology and innovations such as actionable compression to rapidly scan and return relevant data. In-database analytic algorithms integrated from Netezza bring simplicity and performance to advanced analytics.

-You can get started for Free. You get a free account now on Bluemix.net. You get for free at No charge 1GB of data stored. After that, you pay as you grow. 1 GB to 10 GB of available compressed database storage that can hold, respectively, from 5 GB to 50 GB of uncompressed data, based on typical compression ratios. The compression ratio for your data varies based on the characteristics and values in your data set.

-Full integration with R: You can not only run R scripts but also open an instance of the best R Development Environment RStudio completely embedded in the web-browser. You don’t need to install any software on your computer, you can do it all in the Cloud!

alt

Continue reading

Connect Google BigQuery to IBM SPSS Modeler using JDBC with R

If you want to mine your data using IBM SPSS Modeler and your data is stored in the Google Cloud, you can do it and I will show you how in this post.

I am not going to explain how valuable is to use the cloud, and how cool is to set up an Hadoop Cluster using IBM, Amazon, Google or any other’s cloud. In seconds you can have your infrastructure ready to use. So if you are dealing with big amounts of data you might need to mine it…and for this…the best to use is IBM SPSS Modeler!

There are (in my opinion), four different ways to connect to Google BigQuery and IBM SPSS Modeler:

1. R –>There is a package called bigrquery https://github.com/hadley/bigrquery

You cannot use it yet in Modeler 16 because it uses R 2.15 and you need 3.1 to install this package. But for the next release we will be able to install this package and connecting to BigQuery and create an extension node will be very easy.

2. JDBC –> There is a JDBC Open Source driver to do that, and you can find it here: https://code.google.com/p/starschema-bigquery-jdbc/

I created an  extension  for SPSS Modeler using R to connect toBigQuery through JDBC. It is less direct than using the bigrquery package of the previous point, but still quite easy to do. Here you can see how it looks like, using the Custom Dialog Builder I created the user interface and it is as easy as selecting your projectID, UserID, KeyID and then writing the Query.

BigQuerySPSS

3. If you are willing to pay, there are some companies that developed ODBC Drivers to connect to BigQuery: http://www.simba.com/connectors/google-bigquery-odbc

4. The best way might be using IBM SPSS Analytic Server, but BigQuery is not yet supported (but should be possible to implement).

Continue reading

Seven quick facts about R

  1. R is the highest paid IT skill (Dice.com survey, January 2014)
  2. R most-used data science language after SQL (O’Reilly survey, January 2014)
  3. R is used by 70% of data miners (Rexer survey, October 2013)
  4. R is #15 of all programming languages (RedMonk language rankings, January 2014)
  5. R growing faster than any other data science language (KDNuggets survey, August 2013)
  6. R is the #1 Google Search for Advanced Analytics software (Google Trends, March 2014)
  7. R has more than 2 million users worldwide (Oracle estimate, February 2012)

Source: http://blog.revolutionanalytics.com/r-is-hot/