Week 7: k-means Clustering

Stat 431



Time Estimates:
     Videos: 10 min
     Readings: 20 min
     Activities: 40 min
     Check-ins: 6


Convergence

In the last lab, you were asked to implement gradient descent - an approach that modifies your estimates bit by bit, inching ever closer to the ideal values.

Think about how you approached this iteration. If you had kept refining and refining, with ever smaller adjustments to your estimates, your code might ever stop running. At some point, you had to stop and declare victory.

Perhaps you chose a very large number of iterations, and crossed your fingers that the estimates were “close enough” by then.

But perhaps you did something a bit better, and decided to stop iterating when your adjustments stopped making a big difference. Perhaps you cut off your loop when the estimate didn’t change much from one step to the next.

We call this approach convergence to small error. Your algorithm never reached a perfect endpoint, but it did reach a point where the error was small.

This week, you will implement an algorithm to achieve true convergence; that is, there is a point at which no further improvement is possible.


Check-In 1: Convergence to small error


If you did not implement convergence to small error in your Gradient Descent functions, do so now.

Briefly describe how you decided when your iteration should stop; that is, what constitutes “small enough” error?

Canvas Link     

k-means clustering

In statisics, clustering is when you use data to sort observations into groups based on variable values you observe.

One of the most well-known clustering methods is called k-means clustering.


Required Video: Intro to K-means Clustering



As discussed in the video, k-means requires iteration. The steps are:

  1. Choose \(k\) starting seeds.
  2. Assign observations to closest seed.
  3. Re-calculate cluster centroids; set these as seeds.
  4. Repeat 2 and 3 until the centroids don’t change.

Once the centroids do not change at a subsequent step, the algorithm has converged - no further iteration could possibly help.

Clarification: The centroid of the observations simply means the average value on each axis, among the points in the cluster.


Check-In 2: k-means


Question 1:

Use this website to practice walking through the k-means steps

Begin by choosing “Random” centers, and “Gaussian Mixture” data. Try stepping the clustering process for various choices of \(k\). Then try different data formats, to see the strengths and weaknesses of \(k\)-means.

Take a screenshot of the most visually pleasing clustering you can create on the site.

Canvas Link     

Question 2:

Consider the ever-thrilling iris dataset. Suppose we want to classify the flowers into species based on petal size.

iris %>%
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
  geom_point()

There is one fairly clear cluster in the bottom left, and other less clear group of points in the top right.

my_kmeans <-
  iris %>%
  select(Petal.Length, Petal.Width) %>%
  kmeans(centers = 3)

my_kmeans
## K-means clustering with 3 clusters of sizes 52, 48, 50
## 
## Cluster means:
##   Petal.Length Petal.Width
## 1          4.3        1.34
## 2          5.6        2.04
## 3          1.5        0.25
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 13 16  2
##  (between_SS / total_SS =  94.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  1. What was the x-coordinate of the final centroids for cluster 1?
  2. Which cluster was Observation 78 assigned to?
  3. What would you type to see the number of iterations before convergence?

Canvas Link     

Question 3:

In the code below, which checks the results of kmeans() for many numbers of clusters, what is the blanked out code?

test_k <-
  tibble(
    `Number of Clusters` = 2:6,
    SS = map_dbl(2:6, ~kmeans(iris[, 3:4], centers = .x)$            )
)

test_k %>%
  ggplot() +
    aes(x = `Number of Clusters`, y = SS) +
    geom_point() +
    geom_line() +
    ylab("")

Canvas Link     

Principal Components Analysis

Although this is not our focus this week, it would be impossible to discuss k-means without also mentioning Principal Components Analysis (PCA).

Sometimes, we want to cluster data with a large number of variables. Technically, this is not a problem - we simply calculate the distance between points in \(3\) dimensions, or \(4\), or \(5000\). But it’s highly inconvenient: it’s difficult to visualize and difficult to interpret.

In reality, before k-means is uses, we almost always apply dimension reduction to our data using PCA.


Required Video: Intro to PCA



For this class, you do not need to know any underlying math or interpretation behind PCA; only the general interpretation, and how to use R to calculate it.


Check-In 3: PCA


Suppose we perform PCA on a dataset with 10 variables.
Which will have the highest variance?

  1. The scores from PC 1
  2. The scores from PC 10
  3. They will be exactly equal
  4. It depends on the data

Canvas Link     

Case Study: The Federalist Papers

One of the coolest examples of PCA + k-means in history is the story of the Federalist Papers.

In the years leading up to the American Revolution, the Founding Fathers were trying to get the public to support the movement. Three of them decided to write a series of essays. These authors were:

  • John Jay, the first Chief Justice of the U.S. Supreme Court

  • Alexander Hamilton, the first Treasury Secretary

  • John Madison, 4th President of the United States

All the essays were published under all three names. Historical records and writings make it clear which of the men wrote each of 70 of the essays. However, 15 of them do not have known authors.

In 1963, two researchers named Mosteller and Wallace famously used PCA and k-means to try to identify the authors of these unknown essays. They studied the top 200 most common words in the English language, and how often each appeared in the essays.

It looks something like this: (this dataset only has the 70 most common words)

## # A tibble: 4 x 6
##   Author       a       do      is      or     this
##   <fct>    <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
## 1 AH     0.0157  0.000628 0.00754 0.00377 0.00879 
## 2 JJ     0.0174  0        0.00958 0.00599 0.00838 
## 3 JJ     0.00899 0.000692 0.00484 0.0221  0.00415 
## 4 JJ     0.00978 0.00122  0.00611 0.0147  0.000611

If we choose two arbitrary dimensions to plot the data on, it is unhelpful:

(Here DIS means “disputed authorship”.)

PCA

Instead, let’s perform PCA on the data.

fed_pca <- 
  fed %>%
  select(-Author) %>%
  princomp()

We’ll check out the variance by PC:

tibble(
  pcs = 1:70,
  variance = fed_pca$sdev^2
  ) %>%
  ggplot(aes(x = pcs, y = variance)) +
  geom_col()

We are only interested in using the first two principal components.

fed_pc_scores <-
  fed_pca$scores %>%
    as_tibble() %>%
    select(Comp.1, Comp.2) %>%
    mutate(
      Author = fed$Author
    )

head(fed_pc_scores)
## # A tibble: 6 x 3
##     Comp.1   Comp.2 Author
##      <dbl>    <dbl> <fct> 
## 1 -0.00968  0.00696 AH    
## 2 -0.0387  -0.0126  JJ    
## 3 -0.0388  -0.00823 JJ    
## 4 -0.0547  -0.0145  JJ    
## 5 -0.0581  -0.00565 JJ    
## 6 -0.0113  -0.00499 AH

Now, let’s see how well this approach separates the data:

This looks much more helpful! And you probably already have a good guess about the author of the disputed essays.

k-means

Now there appear to be meaningful clusters in the data! Let’s use k-means to try to identify authorship.


Check-In 4: Case Study


  1. Download the data yourself right here.

  2. Perform the PCA reduction, as above.

  3. Perform k-means clustering with 3 clusters on the first two principal components.

  4. Use the results to predict the authorship of the unknown essays.

  5. Create a table or plot/visualization showing your authorship assignment.


Recommended Video: A Hint


Canvas Link