This is a public-facing version of a new course (STAT 431) at Cal Poly taught by Dr. Kelly Bodwin and Dr. Hunter Glanz.

This 9-week course is fully virtual and asynchronous. We would like to invite you to join along with us and our students.

Course Description

Advanced techniques for efficient use of computers to perform statistical computations and to analyze large amounts of data. Includes version control systems; tools supporting reproducibility; functional programming; randomization and bootstrapping; dynamic data visualizations; and R package development.

Disclaimer: This is not an Advanced R Programming course for developers. Rather, it is a second-year course for advanced uses of R in Data Science and Statistics.

Discord Server

To mimic a collaborative classroom environment, we have set up a Discord Server for this course.

Click here for more info, and for some demos of how to best use the server.

Prerequisites

We expect that students experience at the following approximate levels. (Note: Week 1 consists of review of some prerequisite material)

    Two courses in Statistics (Basic hypothesis tests, ANOVA, multiple regression)
    One course in basic computer science (loops and conditionals, objects and functions, etc)
    One course in statistical computing with R (Data cleaning and manipulation, visualization, strings and dates, writing functions, iteration, sampling and bootstrapping)

1. Coursework and practice

You will be given a sequence of readings, videos, and small practice activities. These are meant to replace the in-class lecture and group work experience; as such, you are strongly encouraged to work on these in groups. In particular, you may want to attack these assignments during the Work Parties on Discord.

Note: You will often see links with this symbol dispersed in the coursework materials: . Ignore these; they are for the Cal Poly students only.

2. Lab Assignment

The majority of the true work in this class comes from Lab Assignments. You should plan to spend a large amount of time outside of class (4-8 hours each week) completing your Lab Assignment.

Lab Assignments will be posted on the Course GitHub Site. (They are also linked individually in the schedule below.) We encourage you to work together to review and give feedback on each other's work. We will post grading rubrics for each Lab Assignment, to help with your Peer Review or Self-Assessment.

3. Challenge

Finally, each week we will post a special challenge. This is your chance to push the limits of your new skills! While you won't be able to join the student competition, we encourage you to share your work via GitHub, twitter, and/or on the public Discord.

Coursework

Week Date Topic Lecture and Exercises Lab Assignment Challenge
0 2020-03-30 Basic Setup Getting started with GitHub
Workflow
Lab 0: GitHub
(No Peer Review)
None
1 2020-04-06 Review Data Wrangling
Data Visualization
Lab 1: Review
Peer Review Guide
Challenge 1
(See Submissions)
2 2020-04-13 Advanced Data Visualization Beyond basic plot types
Extending ggplot
Maps with Leaflet
Lab 2: Upgrade Plots
(No Peer Review)
Challenge 2
(See Submissions)
3 2020-04-20 Data From Many Sources New Data Formats
New Data Sources
Lab 3: Use an API
Peer Review Guide
Challenge 3
(See Submissions)
4 2020-04-27 Packages and Package-Based Workflow Writing Functions
Basics of Packages
Contributing to Packages
Lab 4: Create a Package
(Open-Ended Peer Review)
(None)
5 2020-05-04 Matrices and Efficient Computing Basics of Matrix Operations
Multiple Regression
Speed and Efficiency
Lab 5: Implement Regression
(See Lab 5)
6 2020-05-11 Matrix Decomposition; Gradient Descent Gradient Descent
Matrix Decomposition
Lab 6: Implement Regression 2
(Package Repo)
(See Lab 6)
7 2020-05-18 Iteration to Convergence k-means Clustering
Hierarchical Clustering
Lab 7: Implement Clustering
(Package Repo)
(See Lab 7 Instructions)
8/9 2020-05-25 The EM Algorithm