Introduction

In this lab, you will write functions to implement k-means and hierarchical clustering.

You are asked to write these functions in a mini-package (clust431) for ease of sharing. You do not have to do any package creation or management tasks, apart from properly documenting your functions. In particular, you are not required to write unit tests, although you may find it useful.

Click Here to find a package skeleton to work from, if you wish.

Of course, no dedicated clustering functions like kmeans() or hclust() can be used in your functions. Any other packages or functions are fair game.

Tasks

k-means

  • Write one function called k_means() that implements a very basic k-means algorithm.

    • Choose \(k\) random observations in the data as your starting points.
    • Do not do any fancy adjustments to balance cluster sizes and so forth.
    • Include an option in k_means() to automatically perform PCA before doing the k_means() clustering, using only the first 2 dimensions. (You may use built-in functions like princomp() for this.)
    • At a minimum, your function should output the cluster assignments and total sum of squares.

Hierarchical Clustering

  • Write one function called hier_clust() that implements agglomerative hierarchical clustering.

    • You should allow the user to determine the desired number of clusters.
    • You only need to output the cluster assignments, not a dendrogram or cut heights, etc.
    • You may use whichever linkage approach you prefer.

ReadMe

  • Fill out the ReadMe.Rmd for your package, demonstrating that your functions are successful and that the output makes sense.

Challenge

This week, there is no additional Challenge, nor is there any “free” 5 points. However, there are many opportunities for you to get Bonuses.

Although you are only required to write the most basic implementations of these methods, there are many ways to make them snazzier. You may add as many of the following features to your package as you like.

Minor Features:

up to +5 Bonus Points each

These features should not change the default behavior of your functions.

  • Include an option in k_means() to choose the initial clusters in a “smart” way; i.e., a way that is non-random and spreads the clusters out.

  • Write a function that nicely plots the results of your k_means() clustering in the first two PC dimensions. (Hint: Check out geom_ellipse())

  • Include an option in hier_clust() that changes the distance metric used to something besides Euclidean distance.

  • Write a function that prints your hier_clust() results as a beautiful dendrogram. (A plain dendrogram like the automatic output of hclust() will not suffice. However, you may use dedicated functions like ggdendro() for this task.)

Major Features:

up to +10 Bonus Points each

  • Write a function that performs many run-throughs of k_means() with different random starting seeds, then somehow combines the outputs into one final clustering. No for-loops may be used in this function; instead, you should use something from the map or apply family.

  • Write a function that tries k-means several values of \(k\), and suggests a good choice of \(k\) in a smart way. There is not “right” way to choose \(k\); you’ll have to be innovative.

  • Write a function that takes cluster assignments AND true group memberships as input, and produces a visually pleasing summary of how accurate the clusters are. This needs to be a very polished visualization for full bonus.

  • Write a function that uses cluster assignments and incomplete group memberships to guess about the probabilities of categories that unknown observations belong to.
    For example, in the Federalist Papers data, you might return that one of the unknown essays is “50% Hamilton, 40% Jay, 10% Madison”. This does not need to be informed by any fancy math, but it needs to be somewhat logically justified.