Lab 7 Instructions
Introduction
In this lab, you will write functions to implement k-means and hierarchical clustering.
You are asked to write these functions in a mini-package (clust431
) for ease of sharing. You do not have to do any package creation or management tasks, apart from properly documenting your functions. In particular, you are not required to write unit tests, although you may find it useful.
Click Here to find a package skeleton to work from, if you wish.
Of course, no dedicated clustering functions like kmeans()
or hclust()
can be used in your functions. Any other packages or functions are fair game.
Tasks
k-means
Write one function called
k_means()
that implements a very basic k-means algorithm.- Choose \(k\) random observations in the data as your starting points.
- Do not do any fancy adjustments to balance cluster sizes and so forth.
- Include an option in
k_means()
to automatically perform PCA before doing thek_means()
clustering, using only the first 2 dimensions. (You may use built-in functions likeprincomp()
for this.) - At a minimum, your function should output the cluster assignments and total sum of squares.
Hierarchical Clustering
Write one function called
hier_clust()
that implements agglomerative hierarchical clustering.- You should allow the user to determine the desired number of clusters.
- You only need to output the cluster assignments, not a dendrogram or cut heights, etc.
- You may use whichever linkage approach you prefer.
ReadMe
- Fill out the
ReadMe.Rmd
for your package, demonstrating that your functions are successful and that the output makes sense.
Challenge
This week, there is no additional Challenge, nor is there any “free” 5 points. However, there are many opportunities for you to get Bonuses.
Although you are only required to write the most basic implementations of these methods, there are many ways to make them snazzier. You may add as many of the following features to your package as you like.
Minor Features:
up to +5 Bonus Points each
These features should not change the default behavior of your functions.
Include an option in
k_means()
to choose the initial clusters in a “smart” way; i.e., a way that is non-random and spreads the clusters out.Write a function that nicely plots the results of your
k_means()
clustering in the first two PC dimensions. (Hint: Check outgeom_ellipse()
)Include an option in
hier_clust()
that changes the distance metric used to something besides Euclidean distance.Write a function that prints your
hier_clust()
results as a beautiful dendrogram. (A plain dendrogram like the automatic output ofhclust()
will not suffice. However, you may use dedicated functions likeggdendro()
for this task.)
Major Features:
up to +10 Bonus Points each
Write a function that performs many run-throughs of
k_means()
with different random starting seeds, then somehow combines the outputs into one final clustering. No for-loops may be used in this function; instead, you should use something from themap
orapply
family.Write a function that tries k-means several values of \(k\), and suggests a good choice of \(k\) in a smart way. There is not “right” way to choose \(k\); you’ll have to be innovative.
Write a function that takes cluster assignments AND true group memberships as input, and produces a visually pleasing summary of how accurate the clusters are. This needs to be a very polished visualization for full bonus.
Write a function that uses cluster assignments and incomplete group memberships to guess about the probabilities of categories that unknown observations belong to.
For example, in the Federalist Papers data, you might return that one of the unknown essays is “50% Hamilton, 40% Jay, 10% Madison”. This does not need to be informed by any fancy math, but it needs to be somewhat logically justified.