Week 7: k-means Clustering

Stat 431



Time Estimates:
     Videos: 10 min
     Readings: 20 min
     Activities: 40 min
     Check-ins: 6


Convergence

In the last lab, you were asked to implement gradient descent - an approach that modifies your estimates bit by bit, inching ever closer to the ideal values.

Think about how you approached this iteration. If you had kept refining and refining, with ever smaller adjustments to your estimates, your code might ever stop running. At some point, you had to stop and declare victory.

Perhaps you chose a very large number of iterations, and crossed your fingers that the estimates were “close enough” by then.

But perhaps you did something a bit better, and decided to stop iterating when your adjustments stopped making a big difference. Perhaps you cut off your loop when the estimate didn’t change much from one step to the next.

We call this approach convergence to small error. Your algorithm never reached a perfect endpoint, but it did reach a point where the error was small.

This week, you will implement an algorithm to achieve true convergence; that is, there is a point at which no further improvement is possible.


Check-In 1: Convergence to small error


If you did not implement convergence to small error in your Gradient Descent functions, do so now.

Briefly describe how you decided when your iteration should stop; that is, what constitutes “small enough” error?

Canvas Link     

k-means clustering

In statisics, clustering is when you use data to sort observations into groups based on variable values you observe.

One of the most well-known clustering methods is called k-means clustering.


Required Video: Intro to K-means Clustering



As discussed in the video, k-means requires iteration. The steps are:

  1. Choose \(k\) starting seeds.
  2. Assign observations to closest seed.
  3. Re-calculate cluster centroids; set these as seeds.
  4. Repeat 2 and 3 until the centroids don’t change.

Once the centroids do not change at a subsequent step, the algorithm has converged - no further iteration could possibly help.

Clarification: The centroid of the observations simply means the average value on each axis, among the points in the cluster.


Check-In 2: k-means


Question 1:

Use this website to practice walking through the k-means steps

Begin by choosing “Random” centers, and “Gaussian Mixture” data. Try stepping the clustering process for various choices of \(k\). Then try different data formats, to see the strengths and weaknesses of \(k\)-means.

Take a screenshot of the most visually pleasing clustering you can create on the site.

Canvas Link     

Question 2:

Consider the ever-thrilling iris dataset. Suppose we want to classify the flowers into species based on petal size.

iris %>%
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
  geom_point()

There is one fairly clear cluster in the bottom left, and other less clear group of points in the top right.

my_kmeans <-
  iris %>%
  select(Petal.Length, Petal.Width) %>%
  kmeans(centers = 3)

my_kmeans
## K-means clustering with 3 clusters of sizes 52, 48, 50
## 
## Cluster means:
##   Petal.Length Petal.Width
## 1          4.3        1.34
## 2          5.6        2.04
## 3          1.5        0.25
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 13 16  2
##  (between_SS / total_SS =  94.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  1. What was the x-coordinate of the final centroids for cluster 1?
  2. Which cluster was Observation 78 assigned to?
  3. What would you type to see the number of iterations before convergence?

Canvas Link     

Question 3:

In the code below, which checks the results of kmeans() for many numbers of clusters, what is the blanked out code?

test_k <-
  tibble(
    `Number of Clusters` = 2:6,
    SS = map_dbl(2:6, ~kmeans(iris[, 3:4], centers = .x)$            )
)

test_k %>%
  ggplot() +
    aes(x = `Number of Clusters`, y = SS) +
    geom_point() +
    geom_line() +
    ylab("")

Canvas Link     

Principal Components Analysis

Although this is not our focus this week, it would be impossible to discuss k-means without also mentioning Principal Components Analysis (PCA).

Sometimes, we want to cluster data with a large number of variables. Technically, this is not a problem - we simply calculate the distance between points in \(3\) dimensions, or \(4\), or \(5000\). But it’s highly inconvenient: it’s difficult to visualize and difficult to interpret.

In reality, before k-means is uses, we almost always apply dimension reduction to our data using PCA.


Required Video: Intro to PCA



For this class, you do not need to know any underlying math or interpretation behind PCA; only the general interpretation, and how to use R to calculate it.