Instructions

Marks

Exercises

  1. This question is about the normal distribution, and how it relates to the classification rule provided by linear discriminant analysis.

    1. Write down the density function for a bivariate normal distribution (\(p=2\)), with mean \(\mu_k\) and variance \(\Sigma\).
    2. Show that the linear discriminant rule for two groups (\(K=2\)), \(\pi_1=\pi_2\) and \(\Sigma_1=\Sigma_2 = \Sigma = \left[\begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{array}\right]\) where \(\rho\) is the population correlation between the two variables, and \(\sigma_1, \sigma_2\) are the population standard deviations of the two variables, respectively, is equal to: Assign a new observation \(x_0\) to group 1 if \[x_0'\Sigma^{-1}(\mu_1-\mu_2) > \frac{1}{2}(\mu_1 + \mu_2)'\Sigma^{-1}(\mu_1-\mu_2)\]
    3. By generating a grid of values, draw the boundary between two groups, in the 2D space. Use these values for \(\mu_1, \mu_2\) and \(\sigma\). \[\mu_1 = \left[\begin{array}{r} -2 \\ 2 \end{array}\right], ~~~\mu_2 = \left[\begin{array}{r} 2 \\ -2 \end{array}\right]\] \[\Sigma = \left[\begin{array}{rr} 4 & 3 \\ 3 & 5 \end{array}\right]\]
    4. Write down the rule using these parameter values, and sketch the line corresponding to the 1D discriminant space on the previous plot.
  2. This question is about dimension reduction using PCA. We will use data from the world bank, on development indicators for 264 countries. Go to this site https://databank.worldbank.org/data/source/world-development-indicators/preview/on# to extract a copy of the data yourself. You should select all the countries that they have available, and just year 2017. There are some countries that are not countries, that you should exclude: “Upper middle income”, “Pre-demographic dividend”, “Post-demographic dividend”, “Other small states”, “OECD members”, “Not classified”, “Middle income”, “Lower middle income”, “Low income”, “Low & middle income”, “Least developed countries”, “UN classification”, “Late-demographic dividend”, “IDA & IBRD total”, “IDA blend”, “IDA only”, “IDA total”, “High income”, “Heavily indebted poor countries (HIPC)”, “Fragile and conflict affected situations”. Use the default set of variables that are chosen.

You will need to do some cleaning on the data. (My code for cleaning is included in the assignment. It might work for you, but no guarantees.)

  1. Remove the two lines which are missing on the Series Name, and rename 2017 [YR2017] to “value”.
  2. Make a look up dictionary mapping Series Code to Series Name, so that we can do the analysis with the code (shorter), but refer back to the name as needed.
  3. Spread the data into wide form, so that you have variables Country Name, Country Code, and the 55 Series Name variables as columns.
  4. Using the naniar package make a missingness heatmap of the data
  5. Remove any variable that is missing for more than 100 countries. Then, remove any countries that have missing on more than 2 variables.
  6. Use \(k\) nearest neighbours imputation to fill in the missings.

    1. How many countries are in the full data set? How many variables?
    2. How many variables are missing on more than 100 countries?
    3. How many countries have missing values on more than 2 variables, after the variables in b. have been removed?
    4. Compute a principal component analysis for the cleaned and imputed data matrix. Tabulate the proportion of variation explained up to 8 PCs.
    5. Make a scree plot. How many principal components would be suggested? What proportion of variation would be explained by your choice? Please explain your thinking and decisions.

    6. Make biplots of the first three principal components, and explain the contributions of the variables, and similarity of the variables.

    7. Examine the largest coefficients for the first three principal components. Explain and interpret the first three principal components.

    8. Several countries stand out as outliers on the biplots. Examine the principal component scores for the first three principal components. Identify the countries that are at the extremes of each, and explain what their characteristics are. Make a plot to support your arguments.

    9. Use bootstrap, resampling the countries to produce 95% confidence intervals for the coefficients on PC1. Revise your interpretation of PC1 based on significance of the coefficients.

    10. True or false. Principal has the same meaning as principle.