---
title: "ETC3250 2019 - Lab 11"
author: "SOLUTION"
date: "Week 11"
output:
html_document: default
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 6,
fig.align = "center",
cache = FALSE
)
```
# About the data
The lab uses cluster analysis to analyse the happy paintings by Bob Ross. This was the subject of the [538 post](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/), "A Statistical Analysis of the Work of Bob Ross".
We have taken the painting images from the [sales site](http://www.saleoilpaintings.com/paintings/bob-ross/bob-ross-sale-3_1.html), read the images into R, and resized them all to be 20 by 20 pixels. Each painting has been classified into one of 8 classes based on the title of the painting. This is the data that you will work with.
Each row corresponds to one painting, and the rgb color values at each pixel are in each column. With a $20\times20$ image, this leads to $400\times3=1200$ columns.
We have given you a set of 178 paintings. Your job is to cluster the data and thus organise the paintings.
Here is one of the original paintings and how it appears after image processing:
```{r out.width="70%"}
library(tidyverse)
library(ggdendro)
library(ggthemes)
library(magick)
library(fpc)
br5 <- image_read("data/paintings/bobross5.jpg")
plot(br5)
```
```{r out.width="40%"}
# Long form data is easier to plot.
p_long <- read_csv("data/paintings-long-train.csv")
p_long %>% filter(id == 5) %>%
ggplot(aes(x=x, y=-y, fill=h)) +
geom_tile() +
scale_fill_identity() +
theme_solid() +
theme(legend.position="none")
```
# Activities
1. Read in the data. How many variables? What are the variables summarising? How many paintings are processed?
```{r eval=FALSE}
# Read the data to cluster
paintings <- read_csv("data/paintings-subset.csv")
```
2. Conduct a hierarchical cluster analysis. Try using "Wards linkage". Make the dendrogram, and determine how many clusters you believe it suggests. (You can use Euclidean distance, and the variables are already on a comparable scale, so it should not be necessary to standardise them.)
```{r eval=FALSE}
p_d <- dist(paintings[,4:1203])
p_hc_w <- hclust(p_d, method="ward.D2")
ggdendrogram(p_hc_w)
```
**The dendrogram suggests anything from 2-possibly 20 clusters, and around 7 clusters looks maybe a good start.**
3. Compute the cluster statistics suite for a range of choices of cuts of the dendrogram (e.g. $k=2, ..., 10$). How many clusters would be suggested?
```{r eval=FALSE}
p_stats_ncl <- NULL; p_stats <- NULL
for (i in 2:10) {
cl <- cutree(p_hc_w, i)
x <- cluster.stats(p_d, cl)
p_stats_ncl <- cbind(p_stats_ncl, cl)
p_stats <- rbind(p_stats, c(x$within.cluster.ss, x$wb.ratio, x$ch, x$pearsongamma, x$dunn, x$dunn2))
}
colnames(p_stats) <- c("within.cluster.ss","wb.ratio", "ch", "pearsongamma", "dunn", "dunn2")
p_stats <- data.frame(p_stats)
p_stats$cl <- 2:10
p_stats_m <- p_stats %>%
gather(stat, value, -cl)
ggplot(data=p_stats_m) +
geom_line(aes(x=cl, y=value)) + xlab("# clusters") + ylab("") +
facet_wrap(~stat, ncol=3, scales = "free_y") +
theme_bw()
hcl <- cutree(p_hc_w, 5)
```
**There is no clear answer from the cluster statistics. Pearsongamma and wb.ratio suggest around 5.**
4. Using your best choice of $k$, run a $k$-means clustering. Compute a confusion matrix to compare the result from hierarchical with the $k$-means. Are they very similar, or quite different? (One thing that is useful to do when comparing two cluster solutions is to re-label the cluster id's from one group so that they best match, and the larger numbers are down the diagonal of the confusion matrix.)
```{r eval=FALSE}
# Choose k=5
kcl <- kmeans(paintings[,4:1203], 5, nstart=5)$cluster
table(hcl)
table(kcl)
table(hcl, kcl)[,c(4,2,5,3,1)] # rearrangement helps match the clusters
stats1 <- cluster.stats(p_d, hcl)
stats2 <- cluster.stats(p_d, kcl)
# k-means produces better cluster stats
```
**The comparison of cluster statistics on best solution from both suggests $k$-means is slightly better. The big difference is that the number of paintings in the $k$-means solution is more even than hierarchical.**
5. With your best solution, take a look at the paintings that fall into each group. Write a summary explaining how the clustering algorithm organised the paintings.
```{r eval=FALSE}
p_ids <- list.files("data/paintings")
# Hierarchical solution
cl_ids <- paintings[hcl==1, 1:2]
# read in painting images for a cluster and write into a separate directory, to look at externally
for (i in 1:nrow(cl_ids)) {
img <- image_read(paste0("data/paintings/bobross",cl_ids$id[i],".jpg"))
image_write(img, paste0("data/paintings_cluster/",cl_ids$name[i],".jpg"))
}
```
**I used the five clusters solution, and after taking a quick look at the results, this is the summary. The hierarchical makes more sensible clusters. So even though the statistics indicate k-means is slightly better, the clusters from hierarchical are easier to explain. Two clusters are mostly oval paintings, one with light frame, and the other with dark frame. Another cluster is bluish paintings. The two remaining are quite mixed, and suggest that breaking the paintings into more clusters might help produce more homogeneous groups of paintings, e.g. pinkish vs greenish, flowers vs lakes vs forests.**
6. Re-do analysis by first conducting a PCA, and compute interpoint distances on the reduced set of PCs.
```{r eval=FALSE}
p_pca <- prcomp(paintings[,4:1203], scale=FALSE, retx=TRUE)
summary(p_pca)
screeplot(p_pca, npcs=20, type="lines", main="")
p_df <- p_pca$x[,1:15]
p_pca_d <- dist(p_df)
p_pca_hc_w <- hclust(p_pca_d, method="ward.D2")
ggdendrogram(p_pca_hc_w)
p_stats_ncl <- NULL; p_stats <- NULL
for (i in 2:10) {
cl <- cutree(p_pca_hc_w, i)
x <- cluster.stats(p_pca_d, cl)
p_stats_ncl <- cbind(p_stats_ncl, cl)
p_stats <- rbind(p_stats, c(x$within.cluster.ss, x$wb.ratio, x$ch, x$pearsongamma, x$dunn, x$dunn2))
}
colnames(p_stats) <- c("within.cluster.ss","wb.ratio", "ch", "pearsongamma", "dunn", "dunn2")
p_stats <- data.frame(p_stats)
p_stats$cl <- 2:10
p_stats_m <- p_stats %>%
gather(stat, value, -cl)
ggplot(data=p_stats_m) +
geom_line(aes(x=cl, y=value)) + xlab("# clusters") + ylab("") +
facet_wrap(~stat, ncol=3, scales = "free_y") +
theme_bw()
hcl_pca <- cutree(p_hc_w, 8)
table(hcl, hcl_pca)
```
**The reduced set of variables results in a clearer suggestion, from the cluster statistics, of 7-8 clusters. Examining the 8 cluster solution with the previous 5 cluster solution produced on the full data set, shows that it is effectively breaks the bigger clusters and improves the solution.**