1. This question explores bias-variance trade-off. Read in the simulated data possum_magic.rda. This data is generated using the following function:

\[ y = 2x + 10sin(x) + \varepsilon, ~~\text{where}~~x\in [-10, 20], ~~\varepsilon\sim N(0, 4^2)\]

  1. Make a plot of the data, overlaying the true model.
  2. Break the data into a \(2/3\) training and a \(1/3\) test set. (Hint: You can use the function createDataPartition from the caret package.) Fit a linear model, using the training set. Compute the training MSE and test MSE. Overlay the linear model fit on a plot of the data and true model.
  3. Now examine the behaviour of the training and test MSE, for a loess fit.
    1. Look up the loess model fit, and write a paragraph explaining how this fitting procedure works. In particular, explain what the span argument does.
    2. Compute the training and test MSE for a range of span values, 0.5, 0.3, 0.2, 0.1, 0.05, 0.01. Plot the training and test MSE against the span parameter. (For each model, also make a plot of the data and fitted model, just for yourself, but not to hand in.)
    3. Write a paragraph explaining the effect of increasing the flexibility of the fit has on the training and test MSE. Indicate what you think is the optimal span value for this data.
  4. Now examine the relationship between bias, variance and MSE. Compute the bias, MSE and hence variance, for the test set, from the fitted loess models using the span=0.5, 0.3, 0.2, 0.1, 0.05, 0.01. Plot these quantities: MSE, bias, variance against span. Write a few sentences explaining what you learn.
  1. Using the simulated data set, wombat_stew.rda, answer the following questions.
  1. Fit a linear model, using both lm and glm. Make a summary of the model fit, and you will see that they are different: lm reports \(R^2\) but glm reports deviance. What is the relationship between these two goodness of fit statistics? Explain it, and write the R code that shows \(R^2\) can be computed from the deviance.

  2. Make a plot of the residuals from the model against each predictor. Overlay a smoother on each. (Hint: The ggduo function from the GGally package can be useful here. You can plot a single \(Y\) variable against multiple \(X\) variables.) Explain why the linear model may not be appropriate for this data.

  3. Explore adding polynomial terms for each or all predictors, to produce the best possible model fit. Report your best \(R^2\), the final fitted model, and the residual vs predictor plots.