Week 6: Gradient Descent

Stat 431

Time Estimates:

Videos: 15 min

Readings: 40 min

Activities: 60 min

Check-ins: 1

Gradient Descent

It’s often the case that there aren’t nice linear algebra moves to make to solve an optimization problem, e.g. computing regression coefficients that minimize the sum of the squared residuals. Recall, again, that we had the following expression for computing our multiple regression coefficients:

\(\hat{\beta} = (X'X)^{-1} X'Y\)

This expression gave us the coefficients that minimized the sum of squared errors in our multiple regression model:

\(\sum_{i=1}^n (y_i - \hat{y}_i)^2\)

Another category of approaches attempt to solve this problem in an iterative way:

Start the \(\hat{\beta}_0, \dots, \hat{\beta}_p\) coefficients with some initial values.
Compute the error (the expression above).
Change the values of the coefficients in a way that reduces the error.
Repeat steps 2 and 3 until we converge.

While this may sound simple enough, steps 3 and 4 are not trivial. One specific case of this procedure is known as gradient descent and can also be used to estimate our regression coefficients.

Once again, we’re not going to dwell too much on the mathematical derivations, but now that you’ve gained some knowledge in matrix operations in R you should try to understand the general ideas here. You are not responsible for knowing the calculus and linear algebra used in the video.

Required Video: Gradient Descent for Linear Regression

Required Reading: Gradient Descent for Linear Regression

Check-In 1: Gradient Descent for Linear Regression

Why is bad to only take large steps, and never small ones, in the gradient descent algorithm?

We may overshoot and miss the minimum we’re aiming for.
We just end up oscillating from side to side, never actually landing on the minimum.
Both of these

The example context in this reading/video is _____.

simple linear regression
multiple regression
polynomial regression

Based on your experience with regression, will our error ever reach 0?

Yes!
It’s not likely.
It could never reach 0.

How could the algorithm know when to stop?

It’s reached the maximum number of iterations (specified by the user)
The value of the error has not changed by a large amount (specified by the user)
The value of the error has gotten close enough (specified by the user) to 0
The values of the coefficients have not changed by a large amount (specified by the user)
All of these could be useful!

Canvas Link

Required Reading: Gradient Descent for Multiple Regression on Course Canvas Site (PDF)