---
title: "ETC3250 Assignment 2"
date: "DUE: Tuesday 2/4/2019 before start of class"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
eval = FALSE,
message = FALSE,
warning = FALSE)
```
## Instructions
- Assignment needs to be turned in as Rmarkdown, and as html, to moodle. That is, two files need to be submitted.
- You need to list your team members on the report. For each of the four assignments, one team member needs to be nominated as the leader, and is responsible for coordinating the efforts of other team members, and submitting the assignment.
- It is strongly recommended that you individually complete the assignment, and then compare your answers and explanations with your team mates. Each student will have the opportunity to report on other team member's efforts on the assignment, and if a member does not substantially contribute to the team submission they may get a reduced mark, or even a zero mark.
- R code should be hidden in the final report, unless it is specifically requested.
- Original work is expected. Any material used from external sources needs to be acknowledged.
- To make it a little easier for you, a skeleton of R code is provided in the `Rmd` file. Where you see `???` means that something is missing and you will need to fill it in with the appropriate function, argument or operator. You will also need to rearrange the code as necessary to do the calculations needed.
## Marks
- Total mark will be out or 25
- 5 points will be reserved for readability, and appropriate citing of external sources
- 5 points will be reserved for reproducibility, that the report can be re-generated from the submitted Rmarkdown.
- Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 15 points.
## Exercises
1. This question is about the normal distribution, and how it relates to the classification rule provided by linear discriminant analysis.
a. Write down the density function for a bivariate normal distribution ($p=2$), with mean $\mu_k$ and variance $\Sigma$.
b. Show that the linear discriminant rule for two groups ($K=2$), $\pi_1=\pi_2$ and $\Sigma_1=\Sigma_2 = \Sigma = \left[\begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{array}\right]$ where $\rho$ is the population correlation between the two variables, and $\sigma_1, \sigma_2$ are the population standard deviations of the two variables, respectively, is equal to:
*Assign a new observation $x_0$ to group 1 if*
$$x_0'\Sigma^{-1}(\mu_1-\mu_2) > \frac{1}{2}(\mu_1 + \mu_2)'\Sigma^{-1}(\mu_1-\mu_2)$$
c. By generating a grid of values, draw the boundary between two groups, in the 2D space. Use these values for $\mu_1, \mu_2$ and $\sigma$.
$$\mu_1 = \left[\begin{array}{r} -2 \\ 2 \end{array}\right], ~~~\mu_2 = \left[\begin{array}{r} 2 \\ -2 \end{array}\right]$$
$$\Sigma = \left[\begin{array}{rr} 4 & 3 \\ 3 & 5 \end{array}\right]$$
d. Write down the rule using these parameter values, and sketch the line corresponding to the 1D discriminant space on the previous plot.
2. This question is about dimension reduction using PCA. We will use data from the world bank, on development indicators for 264 countries. Go to this site https://databank.worldbank.org/data/source/world-development-indicators/preview/on# to extract a copy of the data yourself. You should select all the countries that they have available, and just year 2017. There are some countries that are not countries, that you should exclude: "Upper middle income", "Pre-demographic dividend", "Post-demographic dividend", "Other small states", "OECD members", "Not classified", "Middle income", "Lower middle income", "Low income", "Low & middle income", "Least developed countries", "UN classification", "Late-demographic dividend", "IDA & IBRD total", "IDA blend", "IDA only", "IDA total", "High income", "Heavily indebted poor countries (HIPC)", "Fragile and conflict affected situations". Use the default set of variables that are chosen.
```{r}
library(readxl)
library(tidyverse)
WDI <- read_xlsx("data/Data_Extract_From_World_Development_Indicators.xlsx", na=???, n_max=14466)
# The following line removes the non-countries, because I selected all countries to
# download. You might not need to run this if you have already filtered out the
# countries before downloading.
exclude <- c("Upper middle income","Pre-demographic dividend", "Post-demographic dividend", "Other small states", "OECD members", "Not classified", "Middle income", "Lower middle income", "Low income", "Low & middle income", "Least developed countries: UN classification", "Late-demographic dividend", "IBRD only", "IDA & IBRD total", "IDA blend", "IDA only", "IDA total", "High income", "Heavily indebted poor countries (HIPC)", "Fragile and conflict affected situations", "Early-demographic dividend", "Data from database: World Development Indicators")
WDI <- WDI %>% filter(!(`Country Name` %in% exclude))
WDI %>% count(`Series Name`) %>% print(n=60)
WDI %>% count(`Country Name`) %>% print(n=300)
```
You will need to do some cleaning on the data. (*My code for cleaning is included in the assignment. It might work for you, but no guarantees.*)
i. Remove the two lines which are missing on the `Series Name`, and rename `2017 [YR2017]` to "value".
ii. Make a look up dictionary mapping `Series Code` to `Series Name`, so that we can do the analysis with the code (shorter), but refer back to the name as needed.
iii. Spread the data into wide form, so that you have variables `Country Name`, `Country Code`, and the 55 `Series Name` variables as columns.
iv. Using the `naniar` package make a missingness heatmap of the data
v. Remove any variable that is missing for more than 100 countries. Then, remove any countries that have missing on more than 2 variables.
vi. Use $k$ nearest neighbours imputation to fill in the missings.
```{r}
# Remove rows with missings and rename measured variable to value
WDI <- WDI %>% filter(!is.na(`Series Name`)) %>%
rename(value = `2017 [YR2017]`)
# Create lookup dictionary
WDI_names <- WDI %>%
select(`Series Code`, `Series Name`) %>%
distinct()
# Make wide form to inspect missing data
library(naniar)
library(impute)
WDI_wide <- WDI %>%
select(-`Series Name`) %>%
spread(`Series Code`, value)
vis_miss(WDI_wide, sort_miss=TRUE, cluster=TRUE)
# Remove variables and countries with too many missings
WDI_varmiss <- WDI %>% filter(is.na(value)) %>% count(`Series Code`, sort=TRUE) %>%
filter(n < 100) # Missing on 2/3 countries removed
WDI_nomiss <- WDI %>% filter(`Series Code` %in% WDI_varmiss$`Series Code`) # Down to 32 variables
WDI_cntmiss <- WDI_nomiss %>% filter(is.na(value)) %>% count(`Country Code`, sort=TRUE) %>% filter(n<3) # Keep countries with only two missings
WDI_nomiss <- WDI_nomiss %>% filter(`Country Code` %in% WDI_cntmiss$`Country Code`)
# Set up for knn, and impute missings
WDI_nomiss_wide <- WDI_nomiss %>% select(-`Series Name`) %>%
spread(`Series Code`, value)
WDI_impute <- impute.knn(as.matrix(WDI_nomiss_wide[,-c(1,2)]))$data
WDI_nomiss_wide[,3:34] <- WDI_impute
```
```{r}
# This is the code to install the impute package with a knn imputation fn
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("impute", version = "3.8")
```
a. How many countries are in the full data set? How many variables?
b. How many variables are missing on more than 100 countries?
c. How many countries have missing values on more than 2 variables, after the variables in b. have been removed?
d. Compute a principal component analysis for the cleaned and imputed data matrix. Tabulate the proportion of variation explained up to 8 PCs.
e. Make a scree plot. How many principal components would be suggested? What proportion of variation would be explained by your choice? Please explain your thinking and decisions.
```{r}
WDI_pca <- prcomp(WDI_nomiss_wide[,3:???], retx=???, center=???, scale=???)
WDI_pca_var <- tibble(n=1:length(WDI_pca$???), evl=WDI_pca$???^2)
ggplot(WDI_pca_var, aes(x=???, y=???)) + geom_???() +
xlab("Number of PCs") + ylab("Eigenvalue")
```
f. Make biplots of the first three principal components, and explain the contributions of the variables, and similarity of the variables.
```{r}
library(gridExtra)
WDI_pca_pcs <- as_tibble(WDI_pca$???[,1:3]) %>%
mutate(cnt=WDI_nomiss_wide$???,
name=WDI_nomiss_wide$`Country Name`)
WDI_pca_evc <- as_tibble(WDI_pca$???[,1:3]) %>%
mutate(origin=rep(0, 32), variable=as.character(1:32),
varname=rownames(WDI_pca$???)) %>%
mutate(PC1s = ???*(WDI_pca_var$???l[1]*2),
PC2s = ???*(WDI_pca_var$???[2]*2),
PC3s = ???*(WDI_pca_var$???[3]*2))
p1 <- ggplot() +
geom_segment(data=WDI_pca_evc, aes(x=origin, xend=???, y=origin, yend=???), colour="red") +
geom_text(data=WDI_pca_evc, aes(x=???, y=???, label=???), colour="red") +
geom_???(data=WDI_pca_pcs, aes(x=???, y=???)) +
geom_text(data=filter(WDI_pca_pcs, abs(PC1)>???), aes(x=???, y=???, label=cnt)) +
geom_text(data=filter(WDI_pca_pcs, abs(PC2)>???), aes(x=???, y=???, label=???)) +
xlab("PC1") + ylab("PC2") +
theme(aspect.ratio=1)
grid.arrange(p1, ???, ???, ncol=3)
```
g. Examine the largest coefficients for the first three principal components. Explain and interpret the first three principal components.
```{r}
WDI_pca_evc %>% select(variable, varname, ???) %>% arrange(desc(???)) %>% print(n=32)
WDI_pca_evc %>%
select(variable, varname, ???) %>%
filter(abs(???)>???) %>%
left_???(WDI_names, by=c("varname"="Series Code")) %>%
arrange(???)
```
h. Several countries stand out as outliers on the biplots. Examine the principal component scores for the first three principal components. Identify the countries that are at the extremes of each, and explain what their characteristics are. Make a plot to support your arguments.
```{r}
WDI_pca_pcs %>%
select(cnt, name, ???) %>%
???(abs(???)>???) %>%
arrange(???)
ggplot(WDI_nomiss_wide, aes(x=???)) + geom_density(fill="black", alpha=0.3) +
geom_vline(data=filter(WDI_nomiss_wide, `Country Name` %in% c(???, ???)), aes(xintercept=???, colour=`Country Name`))
ggplot(WDI_nomiss_wide, aes(x=???)) + geom_???(fill="black", alpha=0.3) +
geom_???(data=filter(WDI_nomiss_wide, `Country Name` %in% c(???, ???)), aes(xintercept=???, colour=`Country Name`))
```
i. Use bootstrap, resampling the countries to produce 95% confidence intervals for the coefficients on PC1. Revise your interpretation of PC1 based on significance of the coefficients.
```{r}
library(boot)
library(forcats)
compute_PC <- function(???, ???) {
pc <- prcomp(???[???,], center=???, scale=???)$???[,???]
if (???(pc[1]) < 0)
pc <- ???
return(???)
}
# Make sure the first coefficient of the PC is positive.
PC1_boot <- boot(data=WDI_nomiss_wide[,3:34], ???, R=1000)
colnames(PC1_boot$t) <- colnames(WDI_nomiss_wide[,3:34])
PC1_boot_t0 <- tibble(var=factor(names(PC1_boot$t0), levels=names(PC1_boot$t0)), t0=PC1_boot$t0)
PC1_boot_ci <- as_tibble(???) %>%
gather(var, coef) %>%
mutate(??? = factor(???, levels=levels(PC1_boot_t0$var))) %>%
group_by(???) %>%
summarise(q2.5 = quantile(???, ???),
q5 = median(???),
q97.5 = quantile(???, ???)) %>%
left_join(PC1_boot_t0)
PC1_boot_ci %>% ???(t0) %>% print(n=32)
ggplot(???, aes(x=fct_reorder(???, q2.5), y=t0)) +
geom_???(yintercept=0, colour="white", size=2) +
geom_???() +
geom_???(aes(ymin=q2.5, ymax=q97.5)) +
coord_flip()
```
j. True or false. *Principal* has the same meaning as *principle*.