Do it yourself

These exercises are from the texbook.

  1. Work your way through the 6.5.1
  1. Which variables make the best 8 variable model?

AtBat,Hits,Walks,CHmRun,CRuns,CWalks,DivisionW,PutOuts

  1. Force it to examine all possible (19) predictors.

  2. Plot the model fit diagnostics for the best model of each size.

  1. What would these diagnostics suggest about an appropriate choice of models? Do your results compare with the text book results? Why not?

BIC would suggest 6 variables. (It gets worse after 6, and then better at 8, and then worse again.) The others suggest around 10. The textbook suggests 6 variables, so similar results here.

  1. Fit forward stepwise selection. How would the decision about best model change?

Full model

#  (Intercept)         Hits        Walks       CAtBat        CHits 
#   79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073 
#       CHmRun    DivisionW      PutOuts 
#    1.4420538 -129.9866432    0.2366813

Forward selection

#  (Intercept)        AtBat         Hits        Walks         CRBI 
#  109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622 
#       CWalks    DivisionW      PutOuts 
#   -0.3053070 -127.1223928    0.2533404

There is some disagreement between the final set of variables. Some variables though are common to all, which would suggest these are important, and the remaining variables may marginally improve a fit.

  1. Does the model change with backward stepwise selection?

Backward selection

#  (Intercept)        AtBat         Hits        Walks        CRuns 
#  105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095 
#       CWalks    DivisionW      PutOuts 
#   -0.7163346 -116.1692169    0.3028847

Yes, same answer as above.

  1. Now repeat the process with a training and test split, to use the test set to help decide on on the best model.
  1. Break the data into a 2/3 training and 1/3 test set.
  2. Fit the best subsets. Compute the mean square error for the test set. Which model would it suggest? Is the subset of models similar to produced on the full data set? Do your results compare with the text book results? Why not?
#  [1] 148477.80 116021.69 118712.55 111536.83 107754.48 107859.91 105282.05
#  [8] 108949.97 110464.56 112739.27 104015.64 104284.84  99201.25  97924.35
# [15]  94048.84  95355.35  95633.08  95525.94  95533.85
#  (Intercept)        AtBat         Hits        HmRun         Runs 
#  229.0276144   -2.1856435    7.3981741   10.6892521   -3.6899424 
#        Walks       CAtBat       CHmRun        CRuns         CRBI 
#    6.6574954   -0.1135930   -1.7682710    1.4533970    0.8509893 
#       CWalks      LeagueN    DivisionW      PutOuts      Assists 
#   -0.7882906   57.0155082 -114.0506360    0.1301923    0.4426554 
#       Errors 
#   -5.8463526

The model changes. The selection of bet sets would be different. This is becase a subset of data was used for training the model. Some chosen variables are the same as the full set, which suggests that these are the more important variables for estimating salary, and some variables are marginally useful.

  1. Try again with cross-validation.
  1. It is said that 10-fold cross-validation is a reasonable choice for dividing the data. What size data sets would this create for this data? Argue whether this is good or bad. With your selection of an appropriate \(k\) conduct the cross-validation.

There isn’t a lot of data. With 10-fold cross-validation only about 20 cases are kept each time, which leads to substantial variability between predictions from each set. I have used 5, which also produces results with considerable variability. The model is fairly weak so it probably doesn’t really matter. The choice of number of variables is consistent for most \(k\), even though the variability in error is substantial.

  1. The book talks about a “model matrix”. What is this? Why does the code for cross-validation need to use this?

Sets up the dummy variable.

  1. Plot the test error against the size of model, coloured by the CV fold. Why does the test error vary by fold? What does the variation mean? What size model is suggested?

The variability between sets is high, but all point to around 5-10 variables.

  1. How do your results compare with the textbook analysis? Can you explain any discrepancies?

Results are similar, but different splits produce some difference in results.

  1. Now we are going to examine regularisation, using lasso.
  1. Using your results from questions 1-3, fit the best least squares model, to your training set. Write down the mean square error and estimates for the final model. We’ll use these to compare with the lasso fit.
# [1] 94048.84
#  (Intercept)        AtBat         Hits        HmRun         Runs 
#  229.0276144   -2.1856435    7.3981741   10.6892521   -3.6899424 
#        Walks       CAtBat       CHmRun        CRuns         CRBI 
#    6.6574954   -0.1135930   -1.7682710    1.4533970    0.8509893 
#       CWalks      LeagueN    DivisionW      PutOuts      Assists 
#   -0.7882906   57.0155082 -114.0506360    0.1301923    0.4426554 
#       Errors 
#   -5.8463526
  1. Fit the lasso to a range of \(\lambda\) values. Plot the standardised coefficients against \(\lambda\). What does this suggest about the predictors?

There are just a few predictors that are important for predicting salary.

  1. Conduct a cross-validation