(Since Week 1 consists of review of Stat 331 material, you should be able to skip some of the required readings and viewings. It is your responsibility to decide which areas you need to review before diving into Stat 431.)
You should feel comfortable with:
Reading data into R from a url.
Downloading data locally and reading it into R.
Dealing with .csv
files, .txt
files with a variety of delimiters, and Excel files.
Handling different variable types, especially strings and factors, and adjusting them if needed.
Using readLines()
and/or read_lines()
to load a file line by line.
You should feel comfortable with:
Using the pipe operator (%>%
)
Describing the overall structure and contents of a data frame or tibble.
Finding basic summary statistics for a data frame.
You should feel comfortable:
Using the five main dplyr
verbs:
filter()
arrange()
select()
mutate()
summarize()
Using group_by()
to perform groupwise operations
Using at least a few other dplyr
verbs for more nuanced tasks
To get the full intended practice in this Check-In, you should try to answer these questions WITHOUT actually running the code.
Recall the world-famous iris
dataset:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Question 1: Suppose we would like to study how the length-to-width ratio of petals differs across the species. Rearrange the following steps in the pipeline into an order that accomplishes this goal.
# a
arrange(Avg.Petal.Ratio)
# b
group_by(Species)
# c
iris
# d
summarize(
Avg.Petal.Ratio = median(Petal.Ratio)
)
# e
mutate(
Petal.Ratio = Petal.Length/Petal.Width
)
Question 2: Consider the base R code below.
For each of the following dplyr
pipelines, indicate if it
# a
iris %>%
filter("Petal.Length") %>%
pull("setosa") %>%
mean()
# b
iris %>%
filter(Species == "setosa") %>%
select(Petal.Length) %>%
summarize(mean(Petal.Length))
# c
iris %>%
pull(Petal.Length) %>%
filter(Species == "setosa") %>%
mean()
# d
iris %>%
filter(Species == "setosa") %>%
select(Petal.Length) %>%
mean()
# e
iris %>%
filter(Species == "setosa") %>%
pull(Petal.Length) %>%
mean()
# f
iris %>%
select(Species == "setosa") %>%
filter(Petal.Length) %>%
summarize(mean(Petal.Length))
You should feel comfortable with:
Understanding what it means for data to be “tidy”
Using the join_*()
family of functions to combine data.
Using bind_rows()
and bind_cols()
, or cbind()
and rbind()
, to combine data.
Using pivot_longer()
and pivot_wider()
to transform data.
Finding basic summary statistics for a data frame.
(Note: This video uses spread()
and gather()
. These functions are now replaced with pivot_longer()
and pivot_wider()
.)
Consider the following dataset, which contains information about arrests for violent crimes in each state:
## Murder Assault UrbanPop
## Alabama 13.2 236 58
## Alaska 10.0 263 48
## Arizona 8.1 294 80
## Arkansas 8.8 190 50
## California 9.0 276 91
## Colorado 7.9 204 78
Question 1: Consider the following code. What does it do, and why might it be an important step before reshaping the data?
Question 2: Fill in the blanks for the code that will produce the following:
us_arrests %>%
pivot_ (cols = ,
= "Crime",
= "Rate")
## # A tibble: 100 x 3
## UrbanPop Crime Rate
## * <int> <chr> <dbl>
## 1 58 Murder 13.2
## 2 58 Assault 236
## 3 48 Murder 10
## 4 48 Assault 263
## 5 80 Murder 8.1
## 6 80 Assault 294
## 7 50 Murder 8.8
## 8 50 Assault 190
## 9 91 Murder 9
## 10 91 Assault 276
## # ... with 90 more rows