---
title: "ETC3250 2019 - Lab 8"
author: "SOLUTION"
date: "Week 8"
output:
html_document: default
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 6,
fig.align = "center",
cache = FALSE
)
```
# Class discussion
This is a diagram explaining boosting. The three tree models in the top row are combined to give the boosted model in box 4. Come up with a some words and sentences, together, to explain the process.
![](boosting.png)
**Compare with the explanation at https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/**
How would a single tree with multiple splits fit this data? What is different about the two approaches?
**It might be almost the same. The first two splits might be the same as box 1, 2. The third split would be only on the subset in the middle. The difference is that with boosting all observations are used each split, but weighted differently. You could think of the single tree as having weights too, either 0 or 1.**
# Activities
- Sign up for a kaggle account.
- Upload s set of predictions for the tennis data challenge.
# Do it yourself
This exercise is based on the lab material in chapter 8 of the textbook, and exercise 11. Solutions to the textbook exercise can be found at https://blog.princehonest.com/stat-learning/ch8/11.html.
1. Use the `Caravan` data from the `ISLR` package. Read the data description.
a. Compute the proportion of caravans purchased to not purchased. Is this a balanced class data set? What problem might be encountered in assessing the accuracy of the model as a consequence?
```{r}
library(tidyverse)
library(ISLR)
data(Caravan)
library(gbm)
library(randomForest)
library(xgboost)
library(caret)
Caravan %>% count(Purchase)
```
**Its not a balanced data set, because there are few caravan purchasers. It means that in assessing the model we need to separately look at the error for each class, because the overall error will be dominated by the error for non-purchasers.**
b. Convert the response variable from a factor to an integer variable, where 1 indicates that the person purchased a caravan.
```{r}
mycaravan <- Caravan %>% mutate(Purchase = as.integer(ifelse(Caravan$Purchase == "Yes", 1, 0)))
```
c. Break the data into 2/3 training and test set, ensuring that the same ratio of the response variable is achieved in both sets. Check that your sampling has produced this.
```{r}
set.seed(20190425)
tr_indx <- createDataPartition(mycaravan$Purchase, p=2/3)$Resample1
c_tr = mycaravan[tr_indx, ]
c_ts = mycaravan[-tr_indx, ]
```
**It does produce the same proportions in each group.**
d. The solution code on the unofficial solution web site:
```
library(ISLR)
train = 1:1000
Caravan$Purchase = ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train = Caravan[train, ]
Caravan.test = Caravan[-train, ]
```
would use just the first 1000 cases for the training set. What is wrong about doing this?
**It may be that the first 1000 cases contain all the purchasers, or that these were the early customers. Its generally never good to take the first X cases for training because we might be introducing a difference between training and test sets. The test set should be similar to the training set.**
2. Here we will fit a boosted tree model, using the `gbm` package.
a. Use 1000 trees, and a shrinkage value of 0.01.
```{r}
c_boost = gbm(Purchase ~ ., data = c_tr, n.trees = 1000, shrinkage = 0.01,
distribution = "bernoulli")
head(summary(c_boost, plotit=FALSE), 6)
```
b. Make a plot of the oob improvement against iteration number. What does this suggest about the number of iterations needed? Why do you think the oob improvement value varies so much, and can also be negative?
```{r}
c_boost_diag <- tibble(iter=1:1000, tr_err=c_boost$train.error, oob_improve=c_boost$oobag.improve)
#ggplot(c_boost_diag, aes(x=iter, y=tr_err)) + geom_line()
ggplot(c_boost_diag, aes(x=iter, y=oob_improve)) + geom_point() + geom_smooth()
```
**Probably around 300 iterations might be sufficient, because it plateaus at 0 around that number. The variation on improvement means that some iterations produce worse results. Re-weighting the observations will sometimes worsen the model.**
c. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
boost.prob = predict(c_boost, c_ts, n.trees = 1000, type = "response")
boost.pred = ifelse(boost.prob > 0.2, 1, 0)
addmargins(table(c_ts$Purchase, boost.pred))
```
**Overall=(62+95)/1940=0.08092784, Non-purchasers=62/1824=0.03399123, Purchasers=95/116=0.8189655**
d. What are the 6 most important variables? Make a plot of each to examine the relationship between these variables and the response. Explain what you learn from these plots.
```{r}
library(gridExtra)
p1 <- ggplot(c_tr, aes(x=Purchase, y=PPLEZIER)) + geom_jitter(height=0, alpha=0.5)
p2 <- ggplot(c_tr, aes(x=factor(PPLEZIER), fill=factor(Purchase))) +
geom_bar(position="fill") + scale_fill_brewer("", palette="Dark2")
p3 <- ggplot(c_tr, aes(x=Purchase, y=PPERSAUT)) + geom_jitter(height=0.1, alpha=0.5)
p4 <- ggplot(c_tr, aes(x=factor(PPERSAUT), fill=factor(Purchase))) +
geom_bar(position="fill") + scale_fill_brewer("", palette="Dark2")
p5 <- ggplot(c_tr, aes(x=Purchase, y=PBRAND)) + geom_jitter(height=0.1, alpha=0.5)
p6 <- ggplot(c_tr, aes(x=factor(PBRAND), fill=factor(Purchase))) +
geom_bar(position="fill") + scale_fill_brewer("", palette="Dark2")
p7 <- ggplot(c_tr, aes(x=Purchase, y=MINKGEM)) + geom_jitter(height=0.1, alpha=0.5)
p8 <- ggplot(c_tr, aes(x=factor(MINKGEM), fill=factor(Purchase))) +
geom_bar(position="fill") + scale_fill_brewer("", palette="Dark2")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol=2)
```
**My recommendation for making these plots is to use a side-by-side dotplot. But we learn immediately that the predictors are primarily categorical. Using jitter can help compare the purchasers against the non-purchasers. Its messy! There doesn't really look like the classification could be good. Another approach is to focus on the proportions in each category of the predictors, using stacked bar chart. This loses the count data, so its still important to use the dotplots, too. From the bar charts, though, it can be seen why the variables are chosen to be important, because some cagtegories of predictors have higher purchase rates. Still, its messy, and doesn't give much confidence about the ability to predict whether a customer will purchase a caravan.**
3. Here we will fit a random forest model, using the `randomForest` package.
a. Use 1000 trees, using a numeric response so that predictions will be a number between 0-1, and set `importance=TRUE`. (Ignore the warning about not having enough distinct values to use regression.)
```{r}
c_rf <- randomForest(Purchase~., data=c_tr, ntree=1000, importance=TRUE)
```
b. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
rf.prob <- predict(c_rf, newdata=c_ts)
rf.pred <- ifelse(rf.prob > 0.2, 1, 0)
addmargins(table(c_ts$Purchase, rf.pred))
```
**Overall=(89+167)/1940=0.1319588, Non-purchasers=167/1824=0.09155702, Purchasers=89/116=0.0.7672414**
c. What are the 6 most important variables? Make a plot of any that are different from those chosen by `gbm`. How does the set of variables compare with those chosen by `gbm`.
```{r}
as_tibble(c_rf$importance) %>% bind_cols(var=rownames(c_rf$importance)) %>%
arrange(desc(IncNodePurity)) %>% print(n=6)
```
**Some are the same, but there are some new ones. None of them look any better when they are plotted, than those found by gbm.**
4. Here we will fit a gradient boosted model, using the `xgboost` package.
a. Read the description of the XGBoost technique at https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/, or other sources. Explain how this algorithm might differ from earlier boosted tree algorithms.
**My understanding is that it is primarily a better weighting optimisation. But this algorithm also seems to have some better calibration for over-fitting, and regularisation to reduce variables.**
b. Tune the model fit to determine how many iterations to make. Then fit the model, using the parameter set provided.
```{r}
c_tr_xg <- xgb.DMatrix(data = as.matrix(c_tr[,-86]), label = c_tr[,86])
c_ts_xg <- xgb.DMatrix(data = as.matrix(c_ts[,-86]), label = c_ts[,86])
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv(params = params, data = c_tr_xg, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
c_xgb <- xgb.train(params = params, data = c_tr_xg, nrounds = 10, watchlist = list(val=c_ts_xg,train=c_tr_xg), print.every.n = 10, early.stop.round = 10, maximize = F , eval_metric = "error")
```
**This would suggest 10 iterations.**
b. Compute the error for the test set, and for each class. Consider a proportion 0.2 or greater to indicate that the customer will purchase a caravan.
```{r}
xgbpred <- predict(c_xgb, c_ts_xg)
xgbpred <- ifelse (xgbpred > 0.2, 1, 0)
addmargins(table(c_ts$Purchase, xgbpred))
```
**Overall=(114+83)/1940=0.1015464, Non-purchasers=114/1824=0.0625, Purchasers=83/116=0.7155172**
c. Compute the variable importance. What are the 6 most important variables? Make a plot of any that are different from those chosen by `gbm` or `randomForest`. How does the set of variables compare with the other two methods.
```{r}
c_xgb_importance <- xgb.importance(feature_names = colnames(c_tr[,-86]), model = c_xgb)
head(c_xgb_importance, 6)
```
**Some are the same, but there are some new ones. None of them look any better when they are plotted, than those found by gbm.**
5. Compare and summarise the results of the three model fits.
**The `xgboost` model has a more balanced overall performance. Ideally one might make a ROC curve for each model but I can't do that at the moment.**