Perhaps others have written on this before (but when I was trying to figure it out I couldn’t find anything specifically about it) or maybe it is so obvious no one felt a need to write about it. However I can be very much literal minded so sometimes it helps to have things very clearly spelled out. So hopefully this is a useful starting point if someone finds themselves in a similar position some time.
In this case I set up a VM using Cloudera’s Distribution Including Apache Hadoop which Cloudera provides their own image for your virtualization solution of choice. I was interested in writing some R code to run using the RMR (R’s programming interface to MapReduce aka RHadoop) packages. Right away I ran into a roadblock since the version of R available in the CDH repository was not compatible with some of the R packages required for RMR to run.
This meant I would need to either build an older version of R from source (which is not a trivial task), play around with the repository setting and hope I don’t break yum, or use an older version of CDH which has access to an appropriate older version. None of these sounded like particularly compelling options to me, but then I had a realization. I could use Revolution Analytics distribution of R (Community Edition — I *am* on a budget here) certainly their build of R and packages in their own repository would be suitable to run RMR. For those of you who don’t know Revolution Analytics sponsors and develops the RMR package. This does work. So you would think this is where the story ends MapReduce away all happy right?
Well while anyone writing some serious R code probably runs said code in batch mode, when you’re writing that code and especially when you are used to writing code in other languages using an IDE of some kind naturally you want an IDE to do R development in for comforts sake if nothing else (want some milk and cookies with that blank?) While Revolution R Enterprise has an IDE the only immediate options for everyone else is StatEt for Eclipse or RStudio. As much as I am a big fan of Eclipse for Java, Perl, etc I have become really fond of RStudio so that’s what I opted for in this case.
One of the really neat things about RStudio is that they have a server option. This allows for a hosted R development environment. Maybe you see where I’m going with this. Setting up the correct port forwarding in your VM options enables you to access RStudio and write R code that can run directly in your CDH environment. No need to worry about copying files from your local machine to the host running Hadoop. Sounds great in theory right? Well there was yet another issue.
RStudio is capable of running multiple instances of R. It finds R by using the which command. So what’s wrong with that? If you left the version of R from CDH/RHEL you’ll be pointing at the wrong thing, but if you were smarter than I was when the system tries which R it won’t find anything. Why is that? Revolution R is called Revo64 not R so the which command returns nothing and RStudio cannot run. Is it time to pack up our marbles and go home? Nope! We can fix this. Fortunately the RStudio team made their system flexible enough that you can point it somewhere else.
In the file put in the following line:
Save the file and restart RStudio service with this command:
> rstudio-server restart
And enjoy editing R code in RStudio via your web browser using Revolution R!