# Week 5: Writing Fast Code

You might be wondering to yourself - why bother with matrices, when we have so many nice tools in R for data frames?

It turns out that the tradeoff for the niceness of R’s data frames is in speed: it takes longer to do basic calculations when you have variables of many different types.

In this module, you’ll learn a bit about how to make your code faster in general.

Time Estimates:
Videos: 10 min
Activities: 10 min
Check-ins: 4

## Code speed

Fortunately, we stand on the shoulders of giants. Many skilled computer scientists have put a lot of time and effort into writing R functions that run as fast as possible. For you, most of the work to speed up code is simply finding the right packages to rely on.

However, there are a few baseline principles that can get you a long way.

If your code takes a long time to run, the reason is often one of these:

• You are doing a very large computation, relative to the power of your computer.
• You are repeating a slow process many times.
• You are trying to do simple things, but on a very large object.

To speed up the code, without deep knowledge of computer algorithms and inner workings, you can sometimes come up with clever ways to avoid these pitfalls.

First, though: as you start thinking about writing faster code, you’ll need a way to check whether your improvements actually sped up the code.

Required Reading: The tictoc and microbenchmark packages

## Tip 1: Avoid larger calculations and memory demands

### Save smaller data frames, if you are going to use them many times

Consider the following dataset:

library(fivethirtyeight)

data(congress_age)

head(congress_age)
## # A tibble: 6 x 13
##   congress chamber bioguide firstname middlename lastname suffix birthday
##      <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr>  <date>
## 1       80 house   M000112  Joseph    Jefferson  Mansfie… <NA>   1861-02-09
## 2       80 house   D000448  Robert    Lee        Doughton <NA>   1863-11-07
## 3       80 house   S000001  Adolph    Joachim    Sabath   <NA>   1866-04-04
## 4       80 house   E000023  Charles   Aubrey     Eaton    <NA>   1868-03-29
## 5       80 house   L000296  William   <NA>       Lewis    <NA>   1868-09-22
## 6       80 house   G000017  James     A.         Gallagh… <NA>   1869-01-16
## # … with 5 more variables: state <chr>, party <chr>, incumbent <lgl>,
## #   termstart <date>, age <dbl>
dim(congress_age)
## [1] 18635    13

Suppose we want to do some exploratory analysis:

tic()

congress_age %>%
filter(chamber == "senate") %>%
summarize(mean(age))

congress_age %>%
filter(chamber == "senate") %>%
ggplot() +
geom_histogram(aes(x = age))
congress_age %>%
filter(chamber == "senate") %>%
top_n(5, age)
toc()
## 0.53 sec elapsed

These are all three reasonable things to do, and they can’t be done in the same pipeline. But wait - we just made R do the process of subsetting to only Senators three separate times!

tic()
congress_age %>%
filter(chamber == "senate")
toc()
## 0.05 sec elapsed

tic()

senate <- congress_age %>%
filter(chamber == "senate")

senate %>%
summarize(mean(age))

senate %>%
ggplot() +
geom_histogram(aes(x = age))
senate %>%
top_n(5, age)
toc()
## 0.36 sec elapsed

### Be smart about order of matrix operations

Think carefully about how matrix multiplication works. You want to avoid doing multiplications between big matrices as much as you can.

Check-In 1: Order of matrix multiplication

Suppose you have two matrices and a (column) vector that you would like to multiply together: $${\bf A} {\bf B}{\bf y}$$.

Question 1: What is the faster way to do this? Use tictoc or microbenchmark or similar to time the steps.

(You may want to make these matrices a bit smaller, if you are on a personal laptop, to avoid annoying R crashes. You can work your way up to bigger ones until you see a noticeable difference in measured speed.)

A <- matrix(rnorm(100000), nrow = 500, ncol = 200)
B <- matrix(rnorm(180000), nrow = 200, ncol = 900)
y <- matrix(rnorm(900))

## a
res <- (A %*% B) %*% y

## b
res <- A %*% (B %*% y)

Question 2: Intuitively, why was the faster one faster? Think about the matrix calculations that are being done at each step.

## Tip 2: Avoid repeating slow steps

### Avoid for-loops

For-loops are deathly slow in R. If you absolutely must iterate over a process, rely on the apply or map function families. (These will typically be close to the same speed, depending on the situation.)

do_stuff_loop <- function(reps) {
results <- vector(length = reps)
for (i in 1:reps) {
results[i] <- i
}
return(results)
}

do_stuff_apply <- function(reps) {
sapply(1:reps, function(x) x)
}

do_stuff_map <- function(reps) {
purrr::map_dbl(1:reps, ~ .x)
}

test <- microbenchmark(
do_stuff_loop(1000),
do_stuff_apply(1000),
do_stuff_map(1000)
)

test
## Unit: microseconds
##                  expr  min   lq mean median   uq  max neval
##   do_stuff_loop(1000)   70   78  131     82   88 4832   100
##  do_stuff_apply(1000)  701  776  839    808  836 4075   100
##    do_stuff_map(1000) 1015 1123 1254   1163 1191 4767   100

### Use vectorized functions

Better even than apply or map is not to iterate at all!

Writing vectorized functions is tough, but do-able in many cases. Here’s an example:

Required Video: Simple Vectorizing Example

Better yet - rely on built-in functions in other packages to do it for you!

A trick we use very often in coding is to “tack on” values to the end of a vector or list as we iterate through something. However, this is actually very slow! If you know how long your list is going to end up, consider pre-creating that list.

do_stuff_allocate <- function(reps) {
results <- vector(length = reps)
for (i in 1:reps) {
results[i] <- i
}
return(results)
}

do_stuff_tackon <- function(reps) {
results <- c()
for (i in 1:reps) {
results <- c(results, i)
}
return(results)
}

test <- microbenchmark(
do_stuff_allocate(1000),
do_stuff_tackon(1000)
)

test
## Unit: microseconds
##                     expr  min   lq mean median   uq   max neval
##  do_stuff_allocate(1000)   85   86  190     89   93 10115   100
##    do_stuff_tackon(1000) 1688 1709 2225   1751 2628  8043   100

## Tip 3: Use faster existing functions

Because R has so many packages, there are often many functions to perform the same task. Not all these are created equal!

### matrix operations

A few other functions you might find useful: rowSums() and colSums(), rowMeans() and colMeans().

Check-In 2: Cross products

For the two approaches below, which is faster?

You may want to start with a smaller matrix, then gradually increase the size until you see a noticeable time difference on your computer.

my_big_matrix <- matrix(rnorm(100000), nrow = 100, ncol = 1000)

## a
res <- t(my_big_matrix) %*% my_big_matrix

## b
res <- crossprod(my_big_matrix)

### data.table

For speeding up work with data frames, no package is better than data.table.

Required Video: 5 minute intro to data.table