In your prerequisite statistics coursework you undoubtedly spent some time on the topic of regression. The idea is that we have a response variable and at least one explanatory variable that is suspected to be related to the response in some way we’d like to model.
In simple linear regression there is just one explanatory variable and the model we’re estimating has an assumed straight-line form. In general, multiple regression extends this idea to accommodate many different complex relationships between a response and any number of explanatory variables, all within a single model.
It’s likely that you used software like R, JMP, or Minitab to estimate the regression model after specifying which variable was your response and which variables were your explanatory variables (predictors). There are many nice functions for performing regression, but today you’ll learn about how to do it yourself!
We’re not going to go through the mathematical derivations, but now that you’ve gained some knowledge in matrix operations in R it should be relatively straightforward to implement regression in R. The video, however, does go into the mathematics of estimating the regression coefficients. You are not responsible for knowing the calculus and linear algebra used in the video.
Notice that this video didn’t actually involve any R! However, it makes the computations required to estimate regression coefficients extremely clear.
There are all sorts of things that can go wrong when performing regression! Unfortunately, it’s a bit beyond the scope of this class to go into them all. So, we’ll encourage you to explore more on your own or take more statistics courses like Cal Poly’s STAT 419 or STAT 434.
One of the things that can be challenging when performing traditional linear regression is having a very large number of explanatory variables. It’s even possible to have more explanatory variables than observations! However, if this is then case then parts of the matrix algebra needed to estimate the coefficients will not work. One method for dealing with this challenge is known as regularization or penalized regression. There are a couple of specific, but popular models that fall under this umbrella: ridge regression and the lasso.
You’re going to learn a bit more about ridge regression, including how to implement it! The basic idea is that we want to fit the regression model in such a way as to end up not including every single explanatory variable we have in our dataset. To do this, ridge regression penalizes coefficients for being large in magnitude (i.e. very non-zero) but still tries to minimize the error (sum of squared residuals) just like traditional regression did.
This may sound a bit complicated, but it actually simplifies somewhat nicely. Check out the reading below for more details!
Notes: