Week 5: Writing Fast Code

Stat 431

You might be wondering to yourself - why bother with matrices, when we have so many nice tools in R for data frames?

It turns out that the tradeoff for the niceness of R’s data frames is in speed: it takes longer to do basic calculations when you have variables of many different types.

In this module, you’ll learn a bit about how to make your code faster in general.

Time Estimates:

Videos: 10 min

Readings: 40 min

Activities: 10 min

Check-ins: 4

Code speed

Fortunately, we stand on the shoulders of giants. Many skilled computer scientists have put a lot of time and effort into writing R functions that run as fast as possible. For you, most of the work to speed up code is simply finding the right packages to rely on.

However, there are a few baseline principles that can get you a long way.

If your code takes a long time to run, the reason is often one of these:

You are doing a very large computation, relative to the power of your computer.
You are repeating a slow process many times.
You are trying to do simple things, but on a very large object.

To speed up the code, without deep knowledge of computer algorithms and inner workings, you can sometimes come up with clever ways to avoid these pitfalls.

First, though: as you start thinking about writing faster code, you’ll need a way to check whether your improvements actually sped up the code.

Required Reading: The tictoc and microbenchmark packages

Required Reading: lobstr::obj_size

Tip 1: Avoid larger calculations and memory demands

Save smaller data frames, if you are going to use them many times

Consider the following dataset:

library(fivethirtyeight)

data(congress_age)

head(congress_age)

## # A tibble: 6 x 13
##   congress chamber bioguide firstname middlename lastname suffix birthday  
##      <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr>  <date>    
## 1       80 house   M000112  Joseph    Jefferson  Mansfie… <NA>   1861-02-09
## 2       80 house   D000448  Robert    Lee        Doughton <NA>   1863-11-07
## 3       80 house   S000001  Adolph    Joachim    Sabath   <NA>   1866-04-04
## 4       80 house   E000023  Charles   Aubrey     Eaton    <NA>   1868-03-29
## 5       80 house   L000296  William   <NA>       Lewis    <NA>   1868-09-22
## 6       80 house   G000017  James     A.         Gallagh… <NA>   1869-01-16
## # … with 5 more variables: state <chr>, party <chr>, incumbent <lgl>,
## #   termstart <date>, age <dbl>

dim(congress_age)

## [1] 18635    13

Suppose we want to do some exploratory analysis:

tic()

congress_age %>%
  filter(chamber == "senate") %>%
  summarize(mean(age))

congress_age %>%
  filter(chamber == "senate") %>%
  ggplot() + 
  geom_histogram(aes(x = age))

congress_age %>%
  filter(chamber == "senate") %>%
  top_n(5, age)

toc()

## 0.53 sec elapsed

These are all three reasonable things to do, and they can’t be done in the same pipeline. But wait - we just made R do the process of subsetting to only Senators three separate times!

tic()
  congress_age %>%
    filter(chamber == "senate")

toc()

## 0.05 sec elapsed

Instead, how about:

tic()

 senate <- congress_age %>%
  filter(chamber == "senate")

senate %>%
  summarize(mean(age))

senate %>%
  ggplot() + 
  geom_histogram(aes(x = age))

senate %>%
  top_n(5, age)

toc()

## 0.36 sec elapsed

Be smart about order of matrix operations

Think carefully about how matrix multiplication works. You want to avoid doing multiplications between big matrices as much as you can.

Check-In 1: Order of matrix multiplication

Suppose you have two matrices and a (column) vector that you would like to multiply together: \({\bf A} {\bf B}{\bf y}\).

Question 1: What is the faster way to do this? Use tictoc or microbenchmark or similar to time the steps.

(You may want to make these matrices a bit smaller, if you are on a personal laptop, to avoid annoying R crashes. You can work your way up to bigger ones until you see a noticeable difference in measured speed.)

A <- matrix(rnorm(100000), nrow = 500, ncol = 200)
B <- matrix(rnorm(180000), nrow = 200, ncol = 900)
y <- matrix(rnorm(900))

## a
res <- (A %*% B) %*% y

## b
res <- A %*% (B %*% y)

Question 2: Intuitively, why was the faster one faster? Think about the matrix calculations that are being done at each step.

Canvas Link

Tip 2: Avoid repeating slow steps

Avoid for-loops

For-loops are deathly slow in R. If you absolutely must iterate over a process, rely on the apply or map function families. (These will typically be close to the same speed, depending on the situation.)

do_stuff_loop <- function(reps) {
  results <- vector(length = reps)
  for (i in 1:reps) {
    results[i] <- i
  }
  return(results)
}

do_stuff_apply <- function(reps) {
  sapply(1:reps, function(x) x)
}


do_stuff_map <- function(reps) {
  purrr::map_dbl(1:reps, ~ .x)
}


test <- microbenchmark(
  do_stuff_loop(1000),
  do_stuff_apply(1000),
  do_stuff_map(1000)
)

test

## Unit: microseconds
##                  expr  min   lq mean median   uq  max neval
##   do_stuff_loop(1000)   70   78  131     82   88 4832   100
##  do_stuff_apply(1000)  701  776  839    808  836 4075   100
##    do_stuff_map(1000) 1015 1123 1254   1163 1191 4767   100

Use vectorized functions

Better even than apply or map is not to iterate at all!

Writing vectorized functions is tough, but do-able in many cases. Here’s an example:

Required Video: Simple Vectorizing Example

Better yet - rely on built-in functions in other packages to do it for you!

Allocate memory in advance

A trick we use very often in coding is to “tack on” values to the end of a vector or list as we iterate through something. However, this is actually very slow! If you know how long your list is going to end up, consider pre-creating that list.

do_stuff_allocate <- function(reps) {
  results <- vector(length = reps)
  for (i in 1:reps) {
    results[i] <- i
  }
  return(results)
}

do_stuff_tackon <- function(reps) {
   results <- c()
  for (i in 1:reps) {
    results <- c(results, i)
  }
  return(results)
}



test <- microbenchmark(
  do_stuff_allocate(1000),
  do_stuff_tackon(1000)
)

test

## Unit: microseconds
##                     expr  min   lq mean median   uq   max neval
##  do_stuff_allocate(1000)   85   86  190     89   93 10115   100
##    do_stuff_tackon(1000) 1688 1709 2225   1751 2628  8043   100

Tip 3: Use faster existing functions

Because R has so many packages, there are often many functions to perform the same task. Not all these are created equal!

matrix operations

A few other functions you might find useful: rowSums() and colSums(), rowMeans() and colMeans().

Check-In 2: Cross products

For the two approaches below, which is faster?

You may want to start with a smaller matrix, then gradually increase the size until you see a noticeable time difference on your computer.

my_big_matrix <- matrix(rnorm(100000), nrow = 100, ncol = 1000)

## a
res <- t(my_big_matrix) %*% my_big_matrix

## b
res <- crossprod(my_big_matrix)

Canvas Link

data.table

For speeding up work with data frames, no package is better than data.table.

Required Video: 5 minute intro to data.table

Required Reading: data.table cheatsheet

Check-In 3: data.table

Here is some dplyr code. Re-write it in data.table syntax. Then perform a speed test to see which one is faster, and by how much.

congress_age %>%
  group_by(state) %>%
  summarize(mean(age))

## # A tibble: 50 x 2
##    state `mean(age)`
##    <chr>       <dbl>
##  1 AK           57.6
##  2 AL           53.0
##  3 AR           52.8
##  4 AZ           54.7
##  5 CA           53.6
##  6 CO           52.3
##  7 CT           49.7
##  8 DE           53.1
##  9 FL           53.4
## 10 GA           53.4
## # … with 40 more rows

Canvas Link

Tip 4: Only improve what needs improving

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.” — Donald Knuth.

Speed and computational efficiency are important, but there can be a tradeoff.

If you make a small speed improvement, but your code becomes overly complex, confusing, or difficult to edit and test it’s not worth it!

Also, consider your time and energy as the programmer. Suppose you have a function that takes 30 minutes to run. It relies on a subfunction that takes 30 seconds to run. Should you really spend your efforts making the subfunction take only 10 seconds?

The art of finding the slow bits of your code where you can make the most improvement is called profiling

Required Reading: Profiling

For this reading, I strongly encourage you to skim. The code aspects are mostly more complicated that we expect in this class; but the general overview of the principles behind speeding up code is helpful.

In your work, you will most likely “profile” manually, by tictoc-ing bits of your code and so forth, not by these fancy methods.

Check-In 4: Profiling

Question 1: Why is vapply faster than sapply?

It specifies the output type.
It specifies the input type.
It performs fewer safety checks.
It performs fewer iterations.

Question 2: Why is mean(x) faster than sum(x)/length(x)?

sum() is vectorized, while mean() is not.
A mean is a smaller number than a sum, so it needs less memory.
mean() calculates the mean twice just in case.
mean() was written in an old version of R.

Question 3: What is the name of the site where you can search packages by topic?

CRANberries
CRAN search algorithm
CRAN archive network
CRAN task views

Canvas Link

Optional: Super advanced mode

The following are some suggestions of ways you can level-up your code speed. These are outside the scope of this class, but you are welcome to explore them!

Extra Resources:

Dive deep into the R efficiency rabbit hole…
Parallelize your computations
Make your code run in C++ with Rcpp
Learn to use recursion
Learn more about memory allocation and garbage collecting in R.