Videos: 10 min

Readings: 20 min

Activities: 40 min

Check-ins: 6

In the last lab, you were asked to implement gradient descent - an approach that modifies your estimates bit by bit, inching ever closer to the ideal values.

Think about how you approached this iteration. If you had kept refining and refining, with ever smaller adjustments to your estimates, your code might ever stop running. At some point, you had to stop and declare victory.

Perhaps you chose a very large number of iterations, and crossed your fingers that the estimates were “close enough” by then.

But perhaps you did something a bit better, and decided to stop iterating when your adjustments stopped making a big difference. Perhaps you cut off your loop when the estimate didn’t change much from one step to the next.

We call this approach *convergence to small error*. Your algorithm never reached a perfect endpoint, but it did reach a point where the error was small.

This week, you will implement an algorithm to achieve **true convergence**; that is, there is a point at which no further improvement is possible.

If you did not implement convergence to small error in your Gradient Descent functions, do so now.

Briefly describe how you decided when your iteration should stop; that is, what constitutes “small enough” error?

In statisics, **clustering** is when you use data to sort observations into groups based on variable values you observe.

One of the most well-known clustering methods is called **k-means** clustering.

As discussed in the video, k-means requires *iteration*. The steps are:

- Choose \(k\) starting seeds.
- Assign observations to closest seed.
- Re-calculate cluster centroids; set these as seeds.
- Repeat 2 and 3 until the centroids don’t change.

Once the centroids do not change at a subsequent step, the algorithm has *converged* - no further iteration could possibly help.

**Clarification:** The *centroid* of the observations simply means the average value on each axis, among the points in the cluster.

**Question 1:**

Use this website to practice walking through the k-means steps

Begin by choosing “Random” centers, and “Gaussian Mixture” data. Try stepping the clustering process for various choices of \(k\). Then try different data formats, to see the strengths and weaknesses of \(k\)-means.

Take a screenshot of the most visually pleasing clustering you can create on the site.

**Question 2:**

Consider the ever-thrilling *iris* dataset. Suppose we want to classify the flowers into species based on petal size.

There is one fairly clear cluster in the bottom left, and other less clear group of points in the top right.

```
## K-means clustering with 3 clusters of sizes 52, 48, 50
##
## Cluster means:
## Petal.Length Petal.Width
## 1 4.3 1.34
## 2 5.6 2.04
## 3 1.5 0.25
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
##
## Within cluster sum of squares by cluster:
## [1] 13 16 2
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
```

- What was the x-coordinate of the final centroids for cluster 1?
- Which cluster was Observation 78 assigned to?
- What would you type to see the number of iterations before convergence?

**Question 3:**

In the code below, which checks the results of `kmeans()`

for many numbers of clusters, what is the blanked out code?

`test_k <-`

tibble(

`Number of Clusters` = 2:6,

SS = map_dbl(2:6, ~kmeans(iris[, 3:4], centers = .x)$ )

)

test_k %>%

ggplot() +

aes(x = `Number of Clusters`, y = SS) +

geom_point() +

geom_line() +

ylab("")

Although this is not our focus this week, it would be impossible to discuss k-means without also mentioning Principal Components Analysis (PCA).

Sometimes, we want to cluster data with a large number of variables. Technically, this is not a problem - we simply calculate the distance between points in \(3\) dimensions, or \(4\), or \(5000\). But it’s highly inconvenient: it’s difficult to visualize and difficult to interpret.

In reality, before k-means is uses, we almost always apply *dimension reduction* to our data using PCA.

For this class, you do not need to know any underlying math or interpretation behind PCA; only the general interpretation, and how to use R to calculate it.