Week 1: Review of Data Wrangling

Stat 431

(Since Week 1 consists of review of Stat 331 material, you should be able to skip some of the required readings and viewings. It is your responsibility to decide which areas you need to review before diving into Stat 431.)

Time Estimates:

Videos: 0-45 min

Readings: 0-90 min

Activities: 0-45 min

Check-ins: 3

Extra Resources:

RStudio Cheatsheets are a great shortcut resource to have on hand, to remind you of which functions exist and what they do.
The R for Data Science Textbook is free online.
RStudio’s Primers are interactive lessons on the basics of R; these would be a great way to refresh your knowledge.

Reading Data

You should feel comfortable with:

Reading data into R from a url.
Downloading data locally and reading it into R.
Dealing with .csv files, .txt files with a variety of delimiters, and Excel files.
Handling different variable types, especially strings and factors, and adjusting them if needed.
Using readLines() and/or read_lines() to load a file line by line.

Required Video: readr and readxl

Recommended Reading: R4DS Chapter 11: Data Import

Data Frames, Tibbles, Piping

You should feel comfortable with:

Using the pipe operator (%>%)
Describing the overall structure and contents of a data frame or tibble.
Finding basic summary statistics for a data frame.

Required Video: Data Frames and the Pipe

Recommended Reading: R4DS Chapter 10: Tibbles

Recommended Reading: R4DS Chapter 18: Pipes

Data Transformation

You should feel comfortable:

Using the five main dplyr verbs:
- filter()
- arrange()
- select()
- mutate()
- summarize()
Using group_by() to perform groupwise operations
Using at least a few other dplyr verbs for more nuanced tasks

Required Reading: R4DS Chapter 5: Data Transformation

Required Video: dplyr (short)

Recommended Video: dplyr (long)

Check-In 1: dplyr and piping

To get the full intended practice in this Check-In, you should try to answer these questions WITHOUT actually running the code.

Recall the world-famous iris dataset:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Question 1: Suppose we would like to study how the length-to-width ratio of petals differs across the species. Rearrange the following steps in the pipeline into an order that accomplishes this goal.

# a
arrange(Avg.Petal.Ratio)


# b
group_by(Species)

# c
iris 
  

# d
summarize(
  Avg.Petal.Ratio = median(Petal.Ratio)
)
  
# e
mutate(
  Petal.Ratio = Petal.Length/Petal.Width
)

Question 2: Consider the base R code below.

mean(iris[iris$Species == "setosa", "Petal.Length"])

For each of the following dplyr pipelines, indicate if it

Returns the exact same thing as the Base R code;
Returns the correct information, but the wrong object type;
Returns incorrect information; or
Returns an error

# a
iris %>%
  filter("Petal.Length") %>%
  pull("setosa") %>%
  mean()


# b
iris %>%
  filter(Species == "setosa") %>%
  select(Petal.Length) %>%
  summarize(mean(Petal.Length))


# c
iris %>%
  pull(Petal.Length) %>%
  filter(Species == "setosa") %>%
  mean()

# d
iris %>%
  filter(Species == "setosa") %>%
  select(Petal.Length) %>%
  mean()

# e
iris %>%
  filter(Species == "setosa") %>%
  pull(Petal.Length) %>%
  mean()

# f
iris %>%
  select(Species == "setosa") %>%
  filter(Petal.Length) %>%
  summarize(mean(Petal.Length))

Canvas Link

Tidy Data and Combining Datasets

You should feel comfortable with:

Understanding what it means for data to be “tidy”
Using the join_*() family of functions to combine data.
Using bind_rows() and bind_cols(), or cbind() and rbind(), to combine data.
Using pivot_longer() and pivot_wider() to transform data.
Finding basic summary statistics for a data frame.

Required Reading: R4DS Chapter 12: Tidy Data

Required Reading: Visual illustrations of join functions

Recommended Video: Tidy Data

(Note: This video uses spread() and gather(). These functions are now replaced with pivot_longer() and pivot_wider().)

Recommended Video: Binding and Joining

Check-In 2: Pivoting

Consider the following dataset, which contains information about arrests for violent crimes in each state:

head(us_arrests)

##            Murder Assault UrbanPop
## Alabama      13.2     236       58
## Alaska       10.0     263       48
## Arizona       8.1     294       80
## Arkansas      8.8     190       50
## California    9.0     276       91
## Colorado      7.9     204       78

Question 1: Consider the following code. What does it do, and why might it be an important step before reshaping the data?

us_arrests <- us_arrests %>%
  rownames_to_column()

Question 2: Fill in the blanks for the code that will produce the following:

us_arrests %>%
  pivot_      (cols =               ,
                        = "Crime",
                         = "Rate")


## # A tibble: 100 x 3
##    UrbanPop Crime    Rate
##  *    <int> <chr>   <dbl>
##  1       58 Murder   13.2
##  2       58 Assault 236  
##  3       48 Murder   10  
##  4       48 Assault 263  
##  5       80 Murder    8.1
##  6       80 Assault 294  
##  7       50 Murder    8.8
##  8       50 Assault 190  
##  9       91 Murder    9  
## 10       91 Assault 276  
## # ... with 90 more rows

Canvas Link