Grant McDermott bio photo

Grant McDermott

Economics.
Environment.
Data science.

Email Twitter LinkedIn   Scholar   ORCID GitHub

The tutorial below has been on my GitHub for a while. I don’t have a good reason for adding it to this blog as well (and at this stage), but here it is all the same.

This is a how-to guide for setting up a server or virtual machine (VM) with Google Compute Engine. In addition, I’ll also show you how to install RStudio Server on your VM, so that you can perform your analysis in almost exactly the same user environment as you’re used to, but now with the full power of cloud-based computation at your disposal. Trust me, it will be awesome.

Prerequisites (read carefully)

  1. Sign up for a 60-day free trial with the Google Cloud Platform. This requires an existing Google/Gmail acount. During the course of sign-up, you should create a project that will be associated with billing. This is purely ceremonial at present – we’re using the free trial period after all – but a billable project ID is required before gaining access to the platform.
  2. Download and follow the installation instructions for the Google Cloud SDK command line utility, gcloud here.

Tip: Please pay proper attention to the Google/Gmail account that you use to sign up for the Cloud Platform in Step 1. (E.g. You might have two Gmail accounts, where one is your personal Gmail and the other is linked to your university email.) Needless to say, you’ll want to make sure that you use the same account when setting up the gloud utility in Step 2. This might all sound obvious, but it has been the primary sticking point during class tutorials, where people encounter a bunch of puzzling authentication errors simply because they aren’t using a consistent account.

Introduction

So what is a virtual machine (VM) and why do I need one anyway? In the simplest sense, a VM is just an emulation of a computer running inside another (bigger) computer. It can potentially perform all or more of the operations that your physical laptop/desktop does, and it might have many of the same properties (from operating system to internal architecture.) The key advantage of a VM from our perspective is that very powerful machines can be “spun up” in the cloud almost effortlessly and then deployed to tackle jobs that are beyond the capabilities of your local computer. Got a big dataset that requires too much memory to analyse on your old laptop? Load it into a high-powered VM. Got some code that takes an age to run? Fire up a VM and let it chug away without consuming any local resources. Or, better yet, write the code in parallel and then spin up a VM with lots of cores (CPUs) to get the analysis done in a fraction of the time. All you need is a working internet connection and a web browser.

Another neat feature of VM’s is that everyone within a team or research group can work on the same OS up in the cloud, regardless of their local machine types and constraints. Anyone who’s ever encountered a situation where code works perfectly on their Linux/Mac/Windows machine, but fails to run on a colleague’s Windows/Mac/Linux machine, will be acutely aware of what I’m talking about. In this way, VMs can be used to overcome many of the interoperability hurdles that can plague scientific/programming teamwork.

Now, with that background knowledge in mind, Google Compute Engine is part of the Google Cloud Platform and delivers high-performance, rapidly scalable VMs. A new VM can be deployed or shut down within seconds, while existing VMs can easily be ramped up or down (cores added, RAM added, etc.) depending on a project’s needs. In my experience, Google Compute Engine is at least as good as Amazon’s AWS – say nothing of the other really cool products within the Cloud Platform suite – and most individual users would be really hard-pressed to spent more than a couple of dollars a month using it. (If that.) This is especially true for the researcher who only needs to crunch a particularly large dataset or run some intensive simulations on occasion, and can easily switch the machine off when it’s not being used.

Tip: While I very much stand by the above paragraph, it is ultimately your responsibility to keep track of your billing and utilisation rates. Take a look at Google’s Could Platform Pricing Calculator to see how much you can expect to be charged for a particular machine and level of usage. You can even set a budget and create usage alerts if you want to be extra cautious.

Two final housekeeping notes, before continuing.

First, it’s possible to complete nearly all of the steps in this guide via the Compute Engine browser console. However, we’ll stick with the gcloud command line utility (which you should have installed already), because that will make it easier to document our steps and will also save us some headaches further down the road. For example, when it comes to transferring files between your local computer and a Compute Engine instance en masse.

Second, almost all VMs run on some variant of Linux. Since we’ll only be connecting to our VM instance via the terminal, this only matters insofar as some of the commands might invoke slightly different syntax to what you’d normally use on a Mac or Windows PC. If you’re brand new to Linux, then I’d recommend taking a quick look at this website. It provides a great step-by-step overview of some of the key concepts and commands. One thing that I’ll briefly mention here is that Ubuntu – the Linux distribution that we’ll be using below – uses the apt package-management system. (Much like macOS uses Homebrew.) So when you see commands like apt install PACKAGENAME, that’s just a convenient way to install and manage packages.

Okay, introduction out of the way. Let’s get up a running.

Setup and log-in

You’ll need to choose an operating system for your VM, as well as the server zone (region). To see the available options, first open up the terminal (Windows, Mac, Linux). Then enter

~$ sudo gcloud compute images list
~$ sudo gcloud compute zones list

Tip: If you get an error message running the above commands, try re-running them without the “sudo” bit at the beginning. This stands for “superuser do”, which invokes special user privileges as a security check, but may be redundant on your system. Clearly, if this applies to you, then you will need to do the same for any other commands invoking “sudo” for the rest of this tutorial.

We’ll go with Ubuntu 16.04 and set our zone to the U.S. west coast.

Tip: You can set the default zone in your local client so that you don’t need to specify it every time. See here.

You can also choose a bunch of other options by using the appropriate flags – see here. I’m going to call my VM instance “rstudio” but you can obviously call it whatever you like. I’m also going to specify the type of machine that I want. In this case, I’ll go with the n1-standard-8 option (8 CPUs with 30GB RAM), but you can choose from a range of machine/memory/pricing options. (Assuming a monthly usage rate of 20 hours, this VM will only cost about $7.60 a month to maintain once our 60-day free trial ends. Regardless, you needn’t worry too much about these initial specs now: It’s very easy to change the specs of your VM down the line and Google will even suggest cheaper alternatives if it thinks that you aren’t using your resource capabilities efficiently over time.) In the terminal window, type:

~$ sudo gcloud compute instances create rstudio --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud  --machine-type n1-standard-8 --zone us-west1-a

This should generate something like:

Created [https://www.googleapis.com/compute/v1/projects/YOUR-PROJECT/zones/us-west1-a/instances/rstudio].
NAME      ZONE        MACHINE_TYPE  PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP      STATUS
rstudio  us-west1-a  n1-standard-8               10.138.0.2   104.198.7.157  RUNNING

Write down the External IP address, as we’ll need it for running RStudio Server later.

Tip: This IP address is ephemeral in the sense that it is only uniquely assigned to your VM while it is running continuously. This shouldn’t create any significant problems, but if you prefer a static (i.e. non-ephemeral) IP address that is always going to be associated with a particular VM instance, then this is easily done. See here.

On a similar note, RStudio Server will run on port 8787 of the External IP, which we need to enable via the Compute Engine firewall.

~$ sudo gcloud compute firewall-rules create allow-rstudio --allow=tcp:8787

Congratulations: Set-up for your Compute Engine VM instance is complete! Easy, wasn’t it?

Not only have you successfully created your VM, but it should now be running on the Compute Engine cloud in the background. If not, see here for instructions on how to start it up.

The next step is to log in via SSH. This is a simple matter of providing your VM’s name and zone (if you forget to specify the zone or haven’t assigned a default, you’ll be prompted):

~$ sudo gcloud compute ssh rstudio --zone us-west1-a

Tip: When trying to SSH into your VM, you may get a message to effect of Error: Please login as the user "ubuntu" rather than the user "root". If this happens, simply amend your command as follows: ~$ sudo gcloud compute ssh ubuntu@rstudio --zone us-west1-a

Upon logging into a Compute Engine project via SSH for the first time, you will be prompted to generate a key passphrase. Needless to say, you should make a note of this passphrase for future long-ins. Your passphrase will be required for all future remote log-ins to Google Cloud projects via gcloud and SSH from your local computer. This includes additional VMs that you create under the same project account.

Passphrase successfully created and entered, you should now be connected to your VM via terminal. That is, you should see something like the following, where “root” is your username and “rstudio” is the server hostname (i.e. your VM):

root@rstudio:~#

Tip: Don’t worry if you’re logged in under a different username instead of “root” like I have here. Again, this is just a reflection of your OS security defaults and/or user preferences. However, it does mean that you will probably have to add “sudo” to the beginning of the remaining terminal commands while you are connected to your VM. In other words, reverse the earlier tweak that I suggested in this tutorial!

Next, we’ll install R.

Install R on your VM

You can find the full set of instructions and recommendations for installing R on Ubuntu here. Or you can just follow my choices below, which covers everything that you should need.

root@rstudio:~# sh -c 'echo "deb https://cloud.r-project.org/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list'
root@rstudio:~# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
root@rstudio:~# apt update
root@rstudio:~# apt install r-base r-base-dev

Tip: Again, if you have trouble running any of the above commands – particularly if your username isn’t “root” – then you should try to add “sudo” to the beginning of each line. Continue doing this for the remaining commands below as needed.

In addition to the above, a number of important R packages require external Linux libraries that must first be installed separately on your VM. For example, the curl R package, which in turn is a dependency for many other packages. In Ubuntu (or other Debian-based distros), run the below commands in your VM’s terminal:

1) For the “tidyverse” suite of packages:

root@rstudio:~# apt install libcurl4-openssl-dev libssl-dev libxml2-dev

2) For the main spatial libraries (sp, rgeos, sf, etc.):

root@rstudio:~# add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable
root@rstudio:~# apt update && apt upgrade
root@rstudio:~# apt install libgeos-dev libproj-dev libgdal-dev

R is now ready to go your on VM directly from terminal:

root@rstudio:~# R

However, we’d obviously prefer to use the awesome IDE interface provided by RStudio (Server). So that’s what we’ll install and configure next, making sure that we can run RStudio Server on our VM via a web browser like Chrome or Firefox from our local computer. (Hit q() and then n to exit the terminal version of R if you opened it above.)

Install and configure RStudio Server

Download RStudio Server on your VM

You should check what the latest available version of Rstudio Server is here, but as of today (30 May 2017) the following is what you need:

root@rstudio:~# apt install gdebi-core
root@rstudio:~# wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
root@rstudio:~# gdebi rstudio-server-1.0.143-amd64.deb

Add a user

Now that you’re logged into your VM, you might notice that you haven’t actually signed in as a specific user. In fact, like me you may be automatically signed into the “rstudio” VM as root. (Fun fact: You can tell because the command line has a hashtag instead of a dollar sign.) This doesn’t matter for most applications, but RStudio Server specifically requires a username/password combination. So we first need to create a new user before continuing. For example, to create a new user called “elvis” enter the follow command in terminal.

root@rstudio:~# adduser elvis

You will then be prompted to specify a password for this user (and confirm various bits of biographical information which you can largely ignore).

Tip: Once created, you can now log into a user’s account on the VM directly via SSH, e.g. sudo gcloud compute ssh elvis@rstudio --zone us-west1-a

You are now ready to open up RStudio Server by navigating to the default 8787 port of your VM’s External IP address. (You remember writing this down earlier, right?) If you forgot to write the IP address down, don’t worry: You can find it by logging into your Google Cloud console and looking at your VM instances, or by opening up a new terminal window (not the one currently connected to your VM) and typing:

~$ sudo gcloud compute instances describe rstudio  --zone us-west1-a

Either way, once you have the address, open up your preferred web browser and navigate to:

http://EXTERNAL-IP-ADDRESS:8787

You will be presented with the following web page. Log in using the username/password that you created earlier.

And we’re all set. Here is RStudio Server running on my laptop via Google Chrome.

Tip: Hit F11 to go full screen in your browser. The server version of RStudio is then almost indistinguishable from the desktop version.

Stopping and (re)starting your VM instance

Stopping and (re)starting your VM instance is very simple, so you don’t have to worry about getting billed for times when you aren’t using it. In a new terminal (not the one currently synced to your VM instance):

~$ sudo gcloud compute instances stop rstudio
~$ sudo gcloud compute instances start rstudio

Again: Easy!

In one sense, this tutorial could end right now. You have successfully installed all the programs and components that you’ll need for high-performance statistical analysis and computing. Your VM will be ready to go with RStudio Server whenever you want it. However, there are still a few more tweaks and tips that we can use to really improve our user experience and reduce complications when interacting with these VMs from our local computers. The rest of this tutorial covers my main tips and recommendations.

Bonus: Getting the most out of your Compute Engine + RStudio Server setup

Transferring and syncing files between your VM and your local computer

You have two (three) main options.

1.1 Manually transfer files directly from RStudio Server

This is arguably the simplest option and works well for copying files from your VM to your local computer. However, I can’t guarantee that it will work as well going the other way; you may need to adjust some user privileges first.

1.2 Manually transfer files and folders using the command line or SCP

Manually transferring files or folders across systems is done fairly easily using the command line:

~$ sudo gcloud compute copy-files rstudio:/home/elvis/Papers/MyAwesomePaper/amazingresults.csv ~/local-directory/amazingresults-copy.csv --zone us-west1-a

It’s also possible to transfer files using your regular desktop file browser thanks to SCP. (On Linus and Mac OSX at least. Windows users first need to install a program call WinSCP.) See here.

Tip: The SCP solution is much more efficient when you have assigned a static IP address to your VM instance – otherwise you have to set it up each time you restart your VM instance and are assigned a new ephemeral IP address – so I’d advise doing that first.

2. Sync with Git(Hub), Box, Dropbox, or Google Drive

Ubuntu, like all Linux distros, comes with Git preinstalled. You should thus be able to sync your results across systems using Git(Hub) in the usual fashion. I tend to use the command line for all my Git operations – committing, pulling, pushing, etc. – and I also had some teething problems with Rstudio Server’s in-built Git UI when I first tried it on a VM. However, I believe that these issues have been mostly resolved so let me know if that works for you. (Update: I can confirm that it appears to be working fine.)

Similarly, while I haven’t tried it myself, you should also be able to install Box, Dropbox or Google Drive on your VM and sync across systems that way. If you go this route, then I’d advise installing these programs as sub-directories of the user’s “home” directory. Even then you may run into problems related to user permissions. However, just follow the instructions for linking to the hypothetical “TeamProject” folder that I describe in the next subsection below (except that you must obviously point towards the relevant Box/Dropbox/GDrive folder location instead) and you should be fine.

Tip: Remember that your VM lives on a server and doesn’t have the usual graphical interface – including installation utilities – of a normal desktop. You’ll thus need to follow command line installation instructions for these programs. Make sure you scroll down to the relevant sections of the links that I have provided above.

Last, but not least, Google themselves encourage data synchronisation on Compute Engine VMs using another product within their Cloud Platform, i.e. Google Storage. This is especially useful for really big data files and folders, but beyond the scope of this tutorial. (If you’re interested in learning more, see here and here.)

Reading and writing files to and from shared directories between multiple users on the same VM

The default configuration above works perfectly well in cases where you are a single user and don’t venture outside of your home directory (and its sub directories). Indeed, you can just add new folders within this user’s home directory using standard Linux commands and you will be able to access these from within RStudio Server when you log in as that user.

However, there’s a slight wrinkle in cases where you want to share information between multiple users on the same VM. (Which may well be necessary on a big group project.) In particular, RStudio Server is only going to be able to look for files in each individual user’s home directory (e.g. /home/elvis.) The reason has to do with user permissions; since Elvis is not a “super user”, RStudio Server doesn’t know that he is allowed to access other user’s directories in our VM, and vice versa. Thankfully, there’s a fairly easy workaround, involving standard Linux commands for adding user and group privileges. I won’t explain these in depth here, but an example follows below:

Let’s say that Elvis is working on a joint project together with a colleague called Priscilla. They have decided to keep all of their shared analysis in a new directory called “TeamProject”, hosted on Elvis’s home directory. We first need to give Priscilla her own user profile. Then, we create the new shared directory under Elvis’s home directory:

root@rstudio:~# adduser priscilla
root@rstudio:~# mkdir /home/elvis/TeamProject

Next, create a group – call it “projectgrp” – whose members should all have full read, write and execute access to files within the Papers directory. Then add both a default user (i.e. Elvis) and other members (i.e. Priscilla) to this group:

root@rstudio:~# groupadd projectgrp
root@rstudio:~# gpasswd -a elvis projectgrp
root@rstudio:~# gpasswd -a priscilla projectgrp

Next, set our default user (i.e. “elvis”) and the other projectgrp members as owners of this directory as well as all of its children directories (chown -R). Grant them all read, write and execute access (chmod -R 770):

root@rstudio:~# chown -R elvis:projectgrp /home/elvis/TeamProject
root@rstudio:~# chmod -R 770 /home/elvis/TeamProject

The next two commands are optional, but advised if Priscilla is only going to be working on this VM through the TeamProject directory. First, you can change her primary group ID to projectgrp, so that all the files she creates are automatically assigned to that group:

root@rstudio:~# usermod -g projectgrp priscilla

Second, you can add a symbolic link to the TeamProject directory in Priscilla’s home directory, so that it is immediately visible when she logs into RStudio Server. (Making sure that you switch to her account before running this command):

root@rstudio:~# su - priscilla
priscilla@rstudio:~$ ln -s /home/elvis/TeamProject /home/priscilla/TeamProject
priscilla@rstudio:~$ exit

Other tips

Remember to keep your VM system up to date (just like you would a normal computer).

root@rstudio:~# gcloud components update
root@rstudio:~# apt update
root@rstudio:~# apt upgrade

Summary

Assuming that you have gone through the initial set-up, here’s the tl;dr summary of how to deploy an existing VM with RStudio Server:

1) Start-up a VM instance.

~$ sudo gcloud compute instances start YOUR-VM-INSTANCE-NAME

2) Take note of the External IP address if you need to (see step 4 below):

~$ sudo gcloud compute instances describe YOUR-VM-INSTANCE-NAME

3) Log-in via SSH.

~$ sudo gcloud compute ssh YOUR-VM-INSTANCE-NAME

4) Open up a web browser and navigate to RStudio Server via your VM’s External IP address (enter your username/password as needed):

http://<external-ip-address>:8787

5) Stop your VM:

~$ sudo gcloud compute instances stop YOUR-VM-INSTANCE-NAME

And, remember, if you really want to avoid the command line, then you can always go through the Compute Engine browser console.

Additional resources

As a final word, just remember to consult the official documentation if you ever get stuck. There’s tonnes of useful advice and extra tips for getting the most out of your VM setup, including ways to integrate your system with other products within the Google Cloud platform like BigQuery, Storage, etc. etc.

  • Google Compute Engine documentation (link)
  • RStudio Server documentation (link)
  • Linux Journey guide (link)