In the last lab, you were asked to implement gradient descent - an approach that modifies your estimates bit by bit, inching ever closer to the ideal values.
Think about how you approached this iteration. If you had kept refining and refining, with ever smaller adjustments to your estimates, your code might ever stop running. At some point, you had to stop and declare victory.
Perhaps you chose a very large number of iterations, and crossed your fingers that the estimates were “close enough” by then.
But perhaps you did something a bit better, and decided to stop iterating when your adjustments stopped making a big difference. Perhaps you cut off your loop when the estimate didn’t change much from one step to the next.
We call this approach convergence to small error. Your algorithm never reached a perfect endpoint, but it did reach a point where the error was small.
This week, you will implement an algorithm to achieve true convergence; that is, there is a point at which no further improvement is possible.
If you did not implement convergence to small error in your Gradient Descent functions, do so now.
Briefly describe how you decided when your iteration should stop; that is, what constitutes “small enough” error?
In statisics, clustering is when you use data to sort observations into groups based on variable values you observe.
One of the most well-known clustering methods is called k-means clustering.
As discussed in the video, k-means requires iteration. The steps are:
Once the centroids do not change at a subsequent step, the algorithm has converged - no further iteration could possibly help.
Clarification: The centroid of the observations simply means the average value on each axis, among the points in the cluster.
Question 1:
Use this website to practice walking through the k-means steps
Begin by choosing “Random” centers, and “Gaussian Mixture” data. Try stepping the clustering process for various choices of \(k\). Then try different data formats, to see the strengths and weaknesses of \(k\)-means.
Take a screenshot of the most visually pleasing clustering you can create on the site.
Question 2:
Consider the ever-thrilling iris dataset. Suppose we want to classify the flowers into species based on petal size.
There is one fairly clear cluster in the bottom left, and other less clear group of points in the top right.
## K-means clustering with 3 clusters of sizes 52, 48, 50
##
## Cluster means:
## Petal.Length Petal.Width
## 1 4.3 1.34
## 2 5.6 2.04
## 3 1.5 0.25
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
##
## Within cluster sum of squares by cluster:
## [1] 13 16 2
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Question 3:
In the code below, which checks the results of kmeans()
for many numbers of clusters, what is the blanked out code?
test_k <-
tibble(
`Number of Clusters` = 2:6,
SS = map_dbl(2:6, ~kmeans(iris[, 3:4], centers = .x)$ )
)
test_k %>%
ggplot() +
aes(x = `Number of Clusters`, y = SS) +
geom_point() +
geom_line() +
ylab("")
Although this is not our focus this week, it would be impossible to discuss k-means without also mentioning Principal Components Analysis (PCA).
Sometimes, we want to cluster data with a large number of variables. Technically, this is not a problem - we simply calculate the distance between points in \(3\) dimensions, or \(4\), or \(5000\). But it’s highly inconvenient: it’s difficult to visualize and difficult to interpret.
In reality, before k-means is uses, we almost always apply dimension reduction to our data using PCA.
For this class, you do not need to know any underlying math or interpretation behind PCA; only the general interpretation, and how to use R to calculate it.
Suppose we perform PCA on a dataset with 10 variables.
Which will have the highest variance?
One of the coolest examples of PCA + k-means in history is the story of the Federalist Papers.
In the years leading up to the American Revolution, the Founding Fathers were trying to get the public to support the movement. Three of them decided to write a series of essays. These authors were:
John Jay, the first Chief Justice of the U.S. Supreme Court
Alexander Hamilton, the first Treasury Secretary
John Madison, 4th President of the United States
All the essays were published under all three names. Historical records and writings make it clear which of the men wrote each of 70 of the essays. However, 15 of them do not have known authors.
In 1963, two researchers named Mosteller and Wallace famously used PCA and k-means to try to identify the authors of these unknown essays. They studied the top 200 most common words in the English language, and how often each appeared in the essays.
It looks something like this: (this dataset only has the 70 most common words)
## # A tibble: 4 x 6
## Author a do is or this
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AH 0.0157 0.000628 0.00754 0.00377 0.00879
## 2 JJ 0.0174 0 0.00958 0.00599 0.00838
## 3 JJ 0.00899 0.000692 0.00484 0.0221 0.00415
## 4 JJ 0.00978 0.00122 0.00611 0.0147 0.000611
If we choose two arbitrary dimensions to plot the data on, it is unhelpful:
(Here DIS
means “disputed authorship”.)
Instead, let’s perform PCA on the data.
We’ll check out the variance by PC:
We are only interested in using the first two principal components.
fed_pc_scores <-
fed_pca$scores %>%
as_tibble() %>%
select(Comp.1, Comp.2) %>%
mutate(
Author = fed$Author
)
head(fed_pc_scores)
## # A tibble: 6 x 3
## Comp.1 Comp.2 Author
## <dbl> <dbl> <fct>
## 1 -0.00968 0.00696 AH
## 2 -0.0387 -0.0126 JJ
## 3 -0.0388 -0.00823 JJ
## 4 -0.0547 -0.0145 JJ
## 5 -0.0581 -0.00565 JJ
## 6 -0.0113 -0.00499 AH
Now, let’s see how well this approach separates the data:
This looks much more helpful! And you probably already have a good guess about the author of the disputed essays.
Now there appear to be meaningful clusters in the data! Let’s use k-means to try to identify authorship.
Download the data yourself right here.
Perform the PCA reduction, as above.
Perform k-means clustering with 3 clusters on the first two principal components.
Use the results to predict the authorship of the unknown essays.
Create a table or plot/visualization showing your authorship assignment.