## Instructions

• Assignment needs to be turned in as Rmarkdown, and as html, to moodle. That is, two files need to be submitted.
• You need to list your team members on the report. For each of the four assignments, one team member needs to be nominated as the leader, and is responsible for coordinating the efforts of other team members, and submitting the assignment.
• It is strongly recommended that you individually complete the assignment, and then compare your answers and explanations with your team mates. Each student will have the opportunity to report on other team member’s efforts on the assignment, and if a member does not substantially contribute to the team submission they may get a reduced mark, or even a zero mark.
• R code should be hidden in the final report, unless it is specifically requested.
• Original work is expected. Any material used from external sources needs to be acknowledged.
• To make it a little easier for you, a skeleton of R code is provided in the Rmd file. Where you see ??? means that something is missing and you will need to fill it in with the appropriate function, argument or operator. You will also need to rearrange the code as necessary to do the calculations needed.

## Marks

• Total mark will be out or 25
• 5 points will be reserved for readability, and appropriate citing of external sources
• 5 points will be reserved for reproducibility, that the report can be re-generated from the submitted Rmarkdown.
• Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 15 points.

## Exercises

1. This question is about the normal distribution, and how it relates to the classification rule provided by linear discriminant analysis.

1. Write down the density function for a bivariate normal distribution ($$p=2$$), with mean $$\mu_k$$ and variance $$\Sigma$$.
2. Show that the linear discriminant rule for two groups ($$K=2$$), $$\pi_1=\pi_2$$ and $$\Sigma_1=\Sigma_2 = \Sigma = \left[\begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{array}\right]$$ where $$\rho$$ is the population correlation between the two variables, and $$\sigma_1, \sigma_2$$ are the population standard deviations of the two variables, respectively, is equal to: Assign a new observation $$x_0$$ to group 1 if $x_0'\Sigma^{-1}(\mu_1-\mu_2) > \frac{1}{2}(\mu_1 + \mu_2)'\Sigma^{-1}(\mu_1-\mu_2)$
3. By generating a grid of values, draw the boundary between two groups, in the 2D space. Use these values for $$\mu_1, \mu_2$$ and $$\sigma$$. $\mu_1 = \left[\begin{array}{r} -2 \\ 2 \end{array}\right], ~~~\mu_2 = \left[\begin{array}{r} 2 \\ -2 \end{array}\right]$ $\Sigma = \left[\begin{array}{rr} 4 & 3 \\ 3 & 5 \end{array}\right]$
4. Write down the rule using these parameter values, and sketch the line corresponding to the 1D discriminant space on the previous plot.
2. This question is about dimension reduction using PCA. We will use data from the world bank, on development indicators for 264 countries. Go to this site https://databank.worldbank.org/data/source/world-development-indicators/preview/on# to extract a copy of the data yourself. You should select all the countries that they have available, and just year 2017. There are some countries that are not countries, that you should exclude: “Upper middle income”, “Pre-demographic dividend”, “Post-demographic dividend”, “Other small states”, “OECD members”, “Not classified”, “Middle income”, “Lower middle income”, “Low income”, “Low & middle income”, “Least developed countries”, “UN classification”, “Late-demographic dividend”, “IDA & IBRD total”, “IDA blend”, “IDA only”, “IDA total”, “High income”, “Heavily indebted poor countries (HIPC)”, “Fragile and conflict affected situations”. Use the default set of variables that are chosen.

You will need to do some cleaning on the data. (My code for cleaning is included in the assignment. It might work for you, but no guarantees.)

1. Remove the two lines which are missing on the Series Name, and rename 2017 [YR2017] to “value”.
2. Make a look up dictionary mapping Series Code to Series Name, so that we can do the analysis with the code (shorter), but refer back to the name as needed.
3. Spread the data into wide form, so that you have variables Country Name, Country Code, and the 55 Series Name variables as columns.
4. Using the naniar package make a missingness heatmap of the data
5. Remove any variable that is missing for more than 100 countries. Then, remove any countries that have missing on more than 2 variables.
6. Use $$k$$ nearest neighbours imputation to fill in the missings.

1. How many countries are in the full data set? How many variables?
2. How many variables are missing on more than 100 countries?
3. How many countries have missing values on more than 2 variables, after the variables in b. have been removed?
4. Compute a principal component analysis for the cleaned and imputed data matrix. Tabulate the proportion of variation explained up to 8 PCs.
5. Make a scree plot. How many principal components would be suggested? What proportion of variation would be explained by your choice? Please explain your thinking and decisions.

6. Make biplots of the first three principal components, and explain the contributions of the variables, and similarity of the variables.

7. Examine the largest coefficients for the first three principal components. Explain and interpret the first three principal components.

8. Several countries stand out as outliers on the biplots. Examine the principal component scores for the first three principal components. Identify the countries that are at the extremes of each, and explain what their characteristics are. Make a plot to support your arguments.

9. Use bootstrap, resampling the countries to produce 95% confidence intervals for the coefficients on PC1. Revise your interpretation of PC1 based on significance of the coefficients.

10. True or false. Principal has the same meaning as principle.