Class discussion

For the olive oil data set, the classification tree will use just one variable for its model. It splits on linoleic acid as shown.

# n= 249 
# 
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
# 
# 1) root 249 98 3 (0.3935743 0.6064257)  
#   2) linoleic>=1053.5 98  0 2 (1.0000000 0.0000000) *
#   3) linoleic< 1053.5 151  0 3 (0.0000000 1.0000000) *

Question 1: There is no gap. Do you think this might be a problem with future data? Why?

There is more uncertainty at the border when the groups are close to each other. And the particular choice of training set might change the rule substantially. Its likely that new observation falling in the region close to the border will often be incorrectly predicted.

If we look at some other variables, oleic and arachidic acid, in relation to linoleic.

Question 2: If you got to choose two variables for splitting the two groups, which would you choose, oleic or arachidic, in association with linoleic?

My pick would be arachidic because there is a big gap between regions, although the tricky part is that is is nonlinear.

Question 3: Suppose you work with linoleic and arachidic. Would quadratic discriminant analysis produce a better separation than the tree? Argue your viewpoint.

No, it doesn’t!! You think it produces a quadratic boundary, and yes, that’s true, but the quadratic is driven by different elliptical variance-covariance, not by a nonlinear gap. The variance-covariance for region 3 is not elliptical.

Question 4: Find a linear combination of linoleic and arachidic, and create a new variable to pass to the tree. Re-fit the tree with this variable instead of the original two. What does the model look like now? Is this better than the original tree?

# n= 249 
# 
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
# 
# 1) root 249 98 3 (0.3935743 0.6064257)  
#   2) linoarach>=109.0926 98  0 2 (1.0000000 0.0000000) *
#   3) linoarach< 109.0926 151  0 3 (0.0000000 1.0000000) *

The model looks better. Even though its only a linear boundary, it is more robust, because this combination of variables gives a bigger gap between the two classes.

Question 5: In general, why is it often important to create new variables (feature engineering) when building models?

Often the problem being studied can be better defined by a specific set of variables which may require some transformation of the original variables.