Week 1: Review of Data Wrangling

Stat 431


(Since Week 1 consists of review of Stat 331 material, you should be able to skip some of the required readings and viewings. It is your responsibility to decide which areas you need to review before diving into Stat 431.)


Time Estimates:
     Videos: 0-45 min
     Readings: 0-90 min
     Activities: 0-45 min
     Check-ins: 3



Extra Resources:

Reading Data

You should feel comfortable with:

  • Reading data into R from a url.

  • Downloading data locally and reading it into R.

  • Dealing with .csv files, .txt files with a variety of delimiters, and Excel files.

  • Handling different variable types, especially strings and factors, and adjusting them if needed.

  • Using readLines() and/or read_lines() to load a file line by line.


Required Video: readr and readxl




Recommended Reading: R4DS Chapter 11: Data Import


Data Frames, Tibbles, Piping

You should feel comfortable with:

  • Using the pipe operator (%>%)

  • Describing the overall structure and contents of a data frame or tibble.

  • Finding basic summary statistics for a data frame.


Required Video: Data Frames and the Pipe




Recommended Reading: R4DS Chapter 10: Tibbles



Recommended Reading: R4DS Chapter 18: Pipes


Data Transformation

You should feel comfortable:

  • Using the five main dplyr verbs:

    • filter()

    • arrange()

    • select()

    • mutate()

    • summarize()

  • Using group_by() to perform groupwise operations

  • Using at least a few other dplyr verbs for more nuanced tasks


Required Reading: R4DS Chapter 5: Data Transformation



Required Video: dplyr (short)




Recommended Video: dplyr (long)



Check-In 1: dplyr and piping


To get the full intended practice in this Check-In, you should try to answer these questions WITHOUT actually running the code.

Recall the world-famous iris dataset:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Question 1: Suppose we would like to study how the length-to-width ratio of petals differs across the species. Rearrange the following steps in the pipeline into an order that accomplishes this goal.

Question 2: Consider the base R code below.

For each of the following dplyr pipelines, indicate if it

  1. Returns the exact same thing as the Base R code;
  2. Returns the correct information, but the wrong object type;
  3. Returns incorrect information; or
  4. Returns an error

Canvas Link     

Tidy Data and Combining Datasets

You should feel comfortable with:

  • Understanding what it means for data to be “tidy”

  • Using the join_*() family of functions to combine data.

  • Using bind_rows() and bind_cols(), or cbind() and rbind(), to combine data.

  • Using pivot_longer() and pivot_wider() to transform data.

  • Finding basic summary statistics for a data frame.


Required Reading: R4DS Chapter 12: Tidy Data



Required Reading: Visual illustrations of join functions



Recommended Video: Tidy Data


(Note: This video uses spread() and gather(). These functions are now replaced with pivot_longer() and pivot_wider().)


Recommended Video: Binding and Joining



Check-In 2: Pivoting


Consider the following dataset, which contains information about arrests for violent crimes in each state:

##            Murder Assault UrbanPop
## Alabama      13.2     236       58
## Alaska       10.0     263       48
## Arizona       8.1     294       80
## Arkansas      8.8     190       50
## California    9.0     276       91
## Colorado      7.9     204       78

Question 1: Consider the following code. What does it do, and why might it be an important step before reshaping the data?

Question 2: Fill in the blanks for the code that will produce the following:

us_arrests %>%
  pivot_      (cols =              ,
                        = "Crime",
                         = "Rate")

## # A tibble: 100 x 3
##    UrbanPop Crime    Rate
##  *    <int> <chr>   <dbl>
##  1       58 Murder   13.2
##  2       58 Assault 236  
##  3       48 Murder   10  
##  4       48 Assault 263  
##  5       80 Murder    8.1
##  6       80 Assault 294  
##  7       50 Murder    8.8
##  8       50 Assault 190  
##  9       91 Murder    9  
## 10       91 Assault 276  
## # ... with 90 more rows

Canvas Link