Class discussion

This is a diagram explaining boosting. The three tree models in the top row are combined to give the boosted model in box 4. Come up with a some words and sentences, together, to explain the process.

Compare with the explanation at https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/

How would a single tree with multiple splits fit this data? What is different about the two approaches?

It might be almost the same. The first two splits might be the same as box 1, 2. The third split would be only on the subset in the middle. The difference is that with boosting all observations are used each split, but weighted differently. You could think of the single tree as having weights too, either 0 or 1.

Activities

Do it yourself

This exercise is based on the lab material in chapter 8 of the textbook, and exercise 11. Solutions to the textbook exercise can be found at https://blog.princehonest.com/stat-learning/ch8/11.html.

  1. Use the Caravan data from the ISLR package. Read the data description.
  1. Compute the proportion of caravans purchased to not purchased. Is this a balanced class data set? What problem might be encountered in assessing the accuracy of the model as a consequence?
# # A tibble: 2 x 2
#   Purchase     n
#   <fct>    <int>
# 1 No        5474
# 2 Yes        348

Its not a balanced data set, because there are few caravan purchasers. It means that in assessing the model we need to separately look at the error for each class, because the overall error will be dominated by the error for non-purchasers.

  1. Convert the response variable from a factor to an integer variable, where 1 indicates that the person purchased a caravan.

  2. Break the data into 2/3 training and test set, ensuring that the same ratio of the response variable is achieved in both sets. Check that your sampling has produced this.

It does produce the same proportions in each group.

  1. The solution code on the unofficial solution web site:
library(ISLR)
train = 1:1000
Caravan$Purchase = ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train = Caravan[train, ]
Caravan.test = Caravan[-train, ]

would use just the first 1000 cases for the training set. What is wrong about doing this?

It may be that the first 1000 cases contain all the purchasers, or that these were the early customers. Its generally never good to take the first X cases for training because we might be introducing a difference between training and test sets. The test set should be similar to the training set.

  1. Here we will fit a boosted tree model, using the gbm package.
  1. Use 1000 trees, and a shrinkage value of 0.01.
#               var   rel.inf
# PPERSAUT PPERSAUT 21.352325
# PPLEZIER PPLEZIER 19.134857
# PBRAND     PBRAND  9.660461
# PBYSTAND PBYSTAND  4.631457
# ALEVEN     ALEVEN  4.179288
# AFIETS     AFIETS  3.650101
  1. Make a plot of the oob improvement against iteration number. What does this suggest about the number of iterations needed? Why do you think the oob improvement value varies so much, and can also be negative?

Probably around 300 iterations might be sufficient, because it plateaus at 0 around that number. The variation on improvement means that some iterations produce worse results. Re-weighting the observations will sometimes worsen the model.

  1. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
#      boost.pred
#          0    1  Sum
#   0   1762   62 1824
#   1     95   21  116
#   Sum 1857   83 1940

Overall=(62+95)/1940=0.08092784, Non-purchasers=62/1824=0.03399123, Purchasers=95/116=0.8189655

  1. What are the 6 most important variables? Make a plot of each to examine the relationship between these variables and the response. Explain what you learn from these plots.