You might be wondering to yourself - why bother with matrices, when we have so many nice tools in R for data frames?
It turns out that the tradeoff for the niceness of Rās data frames is in speed: it takes longer to do basic calculations when you have variables of many different types.
In this module, youāll learn a bit about how to make your code faster in general.
Fortunately, we stand on the shoulders of giants. Many skilled computer scientists have put a lot of time and effort into writing R functions that run as fast as possible. For you, most of the work to speed up code is simply finding the right packages to rely on.
However, there are a few baseline principles that can get you a long way.
If your code takes a long time to run, the reason is often one of these:
To speed up the code, without deep knowledge of computer algorithms and inner workings, you can sometimes come up with clever ways to avoid these pitfalls.
First, though: as you start thinking about writing faster code, youāll need a way to check whether your improvements actually sped up the code.
Consider the following dataset:
## # A tibble: 6 x 13
## congress chamber bioguide firstname middlename lastname suffix birthday
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 80 house M000112 Joseph Jefferson Mansfieā¦ <NA> 1861-02-09
## 2 80 house D000448 Robert Lee Doughton <NA> 1863-11-07
## 3 80 house S000001 Adolph Joachim Sabath <NA> 1866-04-04
## 4 80 house E000023 Charles Aubrey Eaton <NA> 1868-03-29
## 5 80 house L000296 William <NA> Lewis <NA> 1868-09-22
## 6 80 house G000017 James A. Gallaghā¦ <NA> 1869-01-16
## # ā¦ with 5 more variables: state <chr>, party <chr>, incumbent <lgl>,
## # termstart <date>, age <dbl>
## [1] 18635 13
Suppose we want to do some exploratory analysis:
tic()
congress_age %>%
filter(chamber == "senate") %>%
summarize(mean(age))
congress_age %>%
filter(chamber == "senate") %>%
ggplot() +
geom_histogram(aes(x = age))
## 0.53 sec elapsed
These are all three reasonable things to do, and they canāt be done in the same pipeline. But wait - we just made R do the process of subsetting to only Senators three separate times!
## 0.05 sec elapsed
Instead, how about:
tic()
senate <- congress_age %>%
filter(chamber == "senate")
senate %>%
summarize(mean(age))
senate %>%
ggplot() +
geom_histogram(aes(x = age))
## 0.36 sec elapsed
Think carefully about how matrix multiplication works. You want to avoid doing multiplications between big matrices as much as you can.
Suppose you have two matrices and a (column) vector that you would like to multiply together: \({\bf A} {\bf B}{\bf y}\).
Question 1: What is the faster way to do this? Use tictoc
or microbenchmark
or similar to time the steps.
(You may want to make these matrices a bit smaller, if you are on a personal laptop, to avoid annoying R crashes. You can work your way up to bigger ones until you see a noticeable difference in measured speed.)
A <- matrix(rnorm(100000), nrow = 500, ncol = 200)
B <- matrix(rnorm(180000), nrow = 200, ncol = 900)
y <- matrix(rnorm(900))
## a
res <- (A %*% B) %*% y
## b
res <- A %*% (B %*% y)
Question 2: Intuitively, why was the faster one faster? Think about the matrix calculations that are being done at each step.
For-loops are deathly slow in R. If you absolutely must iterate over a process, rely on the apply
or map
function families. (These will typically be close to the same speed, depending on the situation.)
do_stuff_loop <- function(reps) {
results <- vector(length = reps)
for (i in 1:reps) {
results[i] <- i
}
return(results)
}
do_stuff_apply <- function(reps) {
sapply(1:reps, function(x) x)
}
do_stuff_map <- function(reps) {
purrr::map_dbl(1:reps, ~ .x)
}
test <- microbenchmark(
do_stuff_loop(1000),
do_stuff_apply(1000),
do_stuff_map(1000)
)
test
## Unit: microseconds
## expr min lq mean median uq max neval
## do_stuff_loop(1000) 70 78 131 82 88 4832 100
## do_stuff_apply(1000) 701 776 839 808 836 4075 100
## do_stuff_map(1000) 1015 1123 1254 1163 1191 4767 100
Better even than apply
or map
is not to iterate at all!
Writing vectorized functions is tough, but do-able in many cases. Hereās an example:
Better yet - rely on built-in functions in other packages to do it for you!
A trick we use very often in coding is to ātack onā values to the end of a vector or list as we iterate through something. However, this is actually very slow! If you know how long your list is going to end up, consider pre-creating that list.
do_stuff_allocate <- function(reps) {
results <- vector(length = reps)
for (i in 1:reps) {
results[i] <- i
}
return(results)
}
do_stuff_tackon <- function(reps) {
results <- c()
for (i in 1:reps) {
results <- c(results, i)
}
return(results)
}
test <- microbenchmark(
do_stuff_allocate(1000),
do_stuff_tackon(1000)
)
test
## Unit: microseconds
## expr min lq mean median uq max neval
## do_stuff_allocate(1000) 85 86 190 89 93 10115 100
## do_stuff_tackon(1000) 1688 1709 2225 1751 2628 8043 100
Because R has so many packages, there are often many functions to perform the same task. Not all these are created equal!
A few other functions you might find useful: rowSums()
and colSums()
, rowMeans()
and colMeans()
.
For the two approaches below, which is faster?
You may want to start with a smaller matrix, then gradually increase the size until you see a noticeable time difference on your computer.
For speeding up work with data frames, no package is better than data.table
.
Here is some dplyr
code. Re-write it in data.table
syntax. Then perform a speed test to see which one is faster, and by how much.
## # A tibble: 50 x 2
## state `mean(age)`
## <chr> <dbl>
## 1 AK 57.6
## 2 AL 53.0
## 3 AR 52.8
## 4 AZ 54.7
## 5 CA 53.6
## 6 CO 52.3
## 7 CT 49.7
## 8 DE 53.1
## 9 FL 53.4
## 10 GA 53.4
## # ā¦ with 40 more rows
āWe should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.ā ā Donald Knuth.
Speed and computational efficiency are important, but there can be a tradeoff.
If you make a small speed improvement, but your code becomes overly complex, confusing, or difficult to edit and test itās not worth it!
Also, consider your time and energy as the programmer. Suppose you have a function that takes 30 minutes to run. It relies on a subfunction that takes 30 seconds to run. Should you really spend your efforts making the subfunction take only 10 seconds?
The art of finding the slow bits of your code where you can make the most improvement is called profiling
For this reading, I strongly encourage you to skim. The code aspects are mostly more complicated that we expect in this class; but the general overview of the principles behind speeding up code is helpful.
In your work, you will most likely āprofileā manually, by tictoc
-ing bits of your code and so forth, not by these fancy methods.
Question 1: Why is vapply
faster than sapply
?
Question 2: Why is mean(x)
faster than sum(x)/length(x)
?
sum()
is vectorized, while mean()
is not.mean()
calculates the mean twice just in case.mean()
was written in an old version of R.Question 3: What is the name of the site where you can search packages by topic?
The following are some suggestions of ways you can level-up your code speed. These are outside the scope of this class, but you are welcome to explore them!