This is the first of two chapters on Google Compute Engine. We’ll learn how to run R and RStudio Server on virtual machines (VMs) up on the cloud. This means that you’ll be able to conduct your analysis in (almost) exactly the same user environment as you’re used to, but now with the full power of cloud-based computation at your disposal. Trust us, it will be awesome.
These next instructions are important, so please read carefully.
- Sign up for a 12-month ($300 credit) free trial with Google Cloud Platform (GCP). This requires an existing Google/Gmail account.40 During the course of sign-up, you’ll be prompted to enter credit card details for billing purposes. Don’t worry, you won’t be charged unless/until you actively request continued access to GCP after your free trial ends. But a billable project ID is required before gaining access to the platform.
- Download and follow the installation instructions for the Google Cloud SDK command line utility,
gcloud. This is how we’ll connect to GCP from our local computers via the shell.
Thus far in the course, we’ve spent quite a lot of time learning how to code efficiently. We’ve covered topics like functional programming, caching, parallel programming, and so on. All of these tools will help you make the most of the computational resources at your disposal. However, there’s a limit to how far they can take you. At some point, datasets become too big, simulations become too complex, and regressions take too damn long to run to run on your laptop. The only solution beyond this point is
a bigger boat more power.
The easiest and cheapest way to access more computational power these days is through the cloud.41 While there are a number of excellent cloud service providers, in this chapter we’re going to focus on Google Cloud Platform (GCP).42 GCP offers a range of incredibly useful services — some of which we’ll cover in later lectures — and the 12-month free trial makes an ideal entry point for learning about cloud computation.
The particular GCP product that we’re going to use today is Google Compute Engine (GCE). GCE is a service that allows users to launch so-called virtual machines on demand in the cloud (i.e. on Google’s data centers). There’s a lot more that we can say — and will say later — about the benefits can bring to us. But right now, you may well be asking yourself: “What is a virtual machine and why do I need one anyway?”
So, let’s take a step back and quickly clear up some terminology.
A virtual machine (VM) is just an emulation of a computer running inside another (bigger) computer. It can potentially perform all and more of the operations that your physical laptop or desktop does. It might even share many of the same properties, from operating system to internal architecture. The key advantage of a VM from our perspective is that very powerful machines can be “spun up” in the cloud almost effortlessly and then deployed to tackle jobs that are beyond the capabilities of your local computer. Got a big dataset that requires too much memory to analyse on your old laptop? Load it into a high-powered VM. Got some code that takes an age to run? Fire up a VM and let it chug away without consuming any local resources. Or, better yet, write code that runs in parallel and then spin up a VM with lots of cores to get the analysis done in a fraction of the time. All you need is a working internet connection and a web browser.
Now, with that background knowledge in mind, GCE delivers high-performance, rapidly scalable VMs. A new VM can be deployed or shut down within seconds, while existing VMs can easily be ramped up or down depending on a project’s needs (cores added, RAM added, etc.) In our experience, most people would be hard-pressed to spent more than a couple of dollars a month using GCE once their free trial is over. This is especially true for researchers or data scientists who only need to fire up a VM, or VM cluster, occasionally for the most computationally-intensive part of a project, and then can easily switch it off when it is not being used.
Disclaimer: While we very much stand by the above paragraph, it is ultimately your responsibility to keep track of your billing and utilisation rates. Take a look at GCP’s Pricing Calculator to see how much you can expect to be charged for a particular machine and level of usage. You can even set a budget and create usage alerts if you want to be extra cautious.
Our goal for the next two chapters is to set up a VM (or cluster of VMs) on GCE. What’s more, we want to install R and RStudio (Server) on these VMs, so that we can interact with them in exactly the same environment that we’re used to on our own computers. We going to show you two approaches:
- Manually configure GCE with RStudio Server (this chapter)
- Automate with googleComputeEngineR and friends (next chapter)
Both approaches have their merits, but we think it’s important to start with the manual configuration so that you get a good understanding of what’s happening underneath the hood. Let’s get started.
Windows users: You will need to run any multi-line commands (i.e. those that are chained with the backslash character) as single line commands. Basically, delete the trailing “
\” characters at the end of any sub-lines and run it as one long command on a single line.
You’ll need to choose an operating system (OS) for your VM, as well as its designated zone. Let’s quickly look at the available options, since this will also be a good time to confirm that you correctly installed the
gcloud command-line interface. Open up your shell and enter:
Tip: If you get an error message with the above commands, try re-running them with
sudoat the beginning. If this works for you, then you will need to append “sudo” to the other shell commands in this lecture.
You’ll know that everything is working properly if these these commands return a large range of options. If you get an error, please try reinstalling
gcloud again before continuing.
The key shell command for creating your VM is
gcloud compute instances create.
You can specify the type of machine that you want and a range of other options by using the appropriate flags. Let me first show you an example of the command and then walk through our (somewhat arbitrary) choices in more detail. Note that we are going to call our VM instance “my-vm”, but you can call it whatever you want.
Tip: Windows users, remember that you can’t execute multi-line shell commands. Delete the trailing “
\” characters above and run it as one long command on a single line.
Here is a breakdown of the command and a quick explanation of our choices.
gcloud compute instances create my-vm: Create a new VM called “my-vm”. Yes, we are very creative.
--image-family ubuntu-2004-lts --image-project ubuntu-os-cloud: Use Ubuntu 20.04 as the underlying operating system.
--machine-type n1-standard-8: We’ve elected to go with the “N1 Standard 8” option, which means that we’re getting 8 CPUs and 30GB RAM. However, you can choose from a range of machine/memory/pricing options. (Assuming a monthly usage rate of 20 hours, this particular VM will only cost about $7.60 a month to maintain once our free trial ends.) You needn’t worry too much about these initial specs now. New VMs are very easy to create and discard once you get the hang of it. It’s also very simple to change the specs of an already-created VM. GCE will even suggest cheaper specifications if it thinks that you aren’t using your resources efficiently down the line.
--zone us-west1-a: Our preferred zone. The zone choice shouldn’t really matter, although you’ll be prompted to choose one if you forget to include this flag. As a general rule, we advise picking whatever is closest to you.43
Assuming that you ran the above command (perhaps changing the zone to one nearest you), you should see something like the following:
Created [https://www.googleapis.com/compute/v1/projects/YOUR-PROJECT/zones/YOUR-ZONE/instances/YOUR-VM]. NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS my-vm us-west1-a n1-standard-8 10.138.15.222 126.96.36.199 RUNNING
Write down the External IP address, as we’ll need it for running RStudio Server later.44
RStudio Server runs on port 8787 of an associated IP address. Because Google Cloud by default blocks external traffic on GCE VMs for security reasons, we first need to enable the 8787 port via a firewall rule. The following command creates a firewall rule (which we’ll call “rstudio”) that does exactly this.
Note that these firewall rules across every VM in a project. So you should only have to run the above command once.45
Congratulations: Set-up for your GCE VM instance is already complete.
(Easy, wasn’t it?)
The next step is to log in via SSH (i.e. Secure Shell). This is a simple matter of providing your VM’s name and zone. If you forget to specify the zone or haven’t assigned a default, you’ll be prompted.
IMPORTANT: Upon logging into a GCE instance via SSH for the first time, you will be prompted to generate a key passphrase. Needless to say, you should make a note of this passphrase for future long-ins. Your passphrase will be required for all future remote log-ins to Google Cloud projects via
gcloud and SSH from your local computer. This includes additional VMs that you create under the same project account.
Passphrase successfully created and entered, you should now be connected to your VM via SSH. That is, you should see something like the following, where “grant” and “my-vm” will obviously be replaced by your own username and VM hostname.
Next, we’ll install R on our VM.
You can find the full set of instructions and recommendations for installing R on Ubuntu here. Or you can just follow our choices below, which should cover everything that you need. Note that you should be running these commands directly in the shell that is connected to your VM.
## grant@my-vm:~$ sudo sh -c 'echo "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" >> \ /etc/apt/sources.list' sudo apt-key adv --keyserver keyserver.ubuntu.com \ --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 sudo apt update && sudo apt upgrade -y sudo apt install -y r-base r-base-dev
Base R is now installed and ready to go on your VM. However, we’re going to walk you through a two extra steps, since this will avoid some common headaches down the road.
First, we’re going to change where we get our R libraries from:
The above command sets our default library source to RStudio Package Manager (RSPM), instead of the usual CRAN mirror(s). Why would we do this? Well, because RSPM provides pre-compiled R package binaries for Linux, whereas CRAN requires us to install and build from source. Don’t want you to worry too much about this. Just trust us that the above command will allow us to install R packages much faster and with fewer hiccups.46
Second, let’s install some additional system libraries on our VM:
The above system libraries are needed to power some common R packages under the hood. For example, we’ve just installed the underlying geospatial libraries that support the sf package.
If you followed our steps above, then you could already launch directly into R from the shell.47 However, we’d obviously prefer to use the awesome IDE interface provided by RStudio (Server). So that’s what we’ll install and configure next, making sure that we can run RStudio Server on our VM via a web browser from our local computer.
You should check what the latest available version of Rstudio Server is here. At the time of writing, the following is what you need:
Now that you’re connected to your VM, you might notice that you never actually logged in as a specific user. (More discussion here.) This doesn’t matter for most applications, but RStudio Server specifically requires a username/password combination. So we must first create a new user and give them a password before continuing. For example, we can create a new user called “elvis” like so:
You will then be prompted to specify a user password (and confirm various bits of biographical information which you can ignore). An optional, but recommended step is to add your new user to the
sudo group. We’ll cover this in more depth later in the tutorial, but being part of the
sudo group will allow Elvis to temporarily invoke superuser priviledges when needed.
Tip: Once created, you can now log into a user’s account on the VM directly via SSH, e.g.
gcloud compute ssh elvis@my-vm --zone us-west1-a
Stopping and (re)starting your VM instance is a highly advisable, since you don’t want to get billed for times when you aren’t using it. In a new shell window (not the one currently synced to your VM instance):
Contratulations! You now have a fully-integrated VM running R and RStudio whenever you need it. Assuming that you have gone through the initial setup, here’s the tl;dr summary of how to deploy an existing VM with RStudio Server:
- Start up your VM instance.
- Take note of the External IP address for step 3 below.
Open up a web browser and navigate to RStudio Server on your VM. Enter your username and password as needed. http://EXTERNAL-IP-ADDRESS:8787
Log-in via SSH. (Optional)
- Stop your VM.
And, remember, if you really want to avoid the command line, then you can always go through the GCE browser console.
You have already completed all of the steps that you’ll need for high-performance computing in the cloud. Any VM that you create on GCE using the above methods will be ready to go with RStudio Server whenever you want it. However, there are still a few more tweaks and tips that we can use to really improve our user experience and reduce complications when interacting with these VMs from our local computers. The rest of this tutorial covers our main tips and recommendations.
First things first: Remember to keep your VM up to date, just like you would a normal computer. We recommend that you run the following command (really: two commands) regularly:
You can also update the
gcloud utility components on your local computer (i.e. not your VM) with the following command:
You have three main options.
RStudio’s “Files” pane (at the bottom-right) provides various options for moving files and directories around. This includes uploading from your local computer to VM, or exporting the other way around — see the screenshot below. This is arguably the simplest option and works especially well for quick or small jobs.
Manually transferring files or folders across systems is done fairly easily using the command line. Note that this next code chunk would be run in a new shell instance (i.e. not the one connected to your VM via SSH).
It’s also possible to transfer files using your regular desktop file browser thanks to SCP. (On Linux and Mac OSX at least. Windows users first need to install a program call WinSCP.) See here.
Tip: The file browser-based SCP solution is much more efficient when you have assigned a static IP address to your VM instance — otherwise you have to set it up each time you restart your VM instance and are assigned a new ephemeral IP address — so we’d advise doing that first.
This is our own preferred option. Ubuntu, like all virtually Linux distros, comes with Git preinstalled. You should thus be able to sync your results across systems using Git(Hub) in the usual fashion. We tend to use the command line for all our Git operations (committing, pulling, pushing, etc.) and this works exactly as expected once you’ve SSH’d into your VM. However, Rstudio Server’s built-in Git UI also works well and comes with some nice added functionality (highlighted diff. sections and so forth).
While we haven’t tried it ourselves, you should also be able to install Box, Dropbox or Google Drive on your VM and sync across systems that way. If you go this route, then we’d advise installing these programs as sub-directories of the user’s “home” directory. Even then you may run into problems related to user permissions. However, just follow the instructions for linking to the hypothetical “TeamProject” folder that we describe below (except that you must obviously point towards the relevant Box/Dropbox/GDrive folder location instead) and you should be fine.
Tip: Remember that your VM lives on a server and doesn’t have the usual graphical interface — including installation utilities — of a normal desktop. You’ll thus need to follow command line installation instructions for these programs. Make sure you scroll down to the relevant sections of the links that we have provided above.
Last, but not least, Google themselves encourage data synchronisation on GCE VMs using another product within their Cloud Platform, i.e. Google Storage. This is especially useful for really big data files and folders, but beyond the scope of this lecture. (If you’re interested in learning more, see here and here.)
In the next chapter, we’ll build on today’s material by showing you how to automate a lot of steps with the googleComputeEngineR package and related tools. In the meantime, here are some further resources that you might find useful.
- We recommend consulting the official GCE documentation if you ever get stuck. There’s loads of useful advice and extra tips for getting the most out of your VM setup, including ways to integrate your system with other GCP products like Storage, BigQuery, etc.
- Other useful links include the RStudio Server documentation, and the Linux Journey guide for anyone who wants to learn more about Linux (yes, you!).
As we discussed in the previous chapter on parallel programming, R ships with its own BLAS/LAPACK libraries by default. While this default works well enough, you can get significant speedups by switching to more optimized libraries such as the Intel Math Kernel Library (MKL) or OpenBLAS. The former is slightly faster according to the benchmark tests that we’ve seen, but it’s a close run thing. Either can be installed with a simple one line command, e.g.
Tip: If you run this command in your shell (SSH), you’ll need to consent to some options. Click
TABon your keyboard to cycle through these and
ENTERto select. Just to say “yes” / “okay” to everything.
Once MKL (or OpenBLAS) is installed, your R session should automatically be configured to use it by default. You can check yourself by opening up R and checking the
sessionInfo() output, which should return something like:
As per our parallel programming chapter, you may also wish to switch off MKL’s default multithreading capabilities in R to avoid nested parallelism. You can use the method that we saw in that chapter, or set it as an environment variable in your .Renviron file.