Class discussion

For the olive oil data set, the classification tree will use just one variable for its model. It splits on linoleic acid as shown.

# n= 249 
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
# 1) root 249 98 3 (0.3935743 0.6064257)  
#   2) linoleic>=1053.5 98  0 2 (1.0000000 0.0000000) *
#   3) linoleic< 1053.5 151  0 3 (0.0000000 1.0000000) *

Question 1: There is no gap. Do you think this might be a problem with future data? Why?

If we look at some other variables, oleic and arachidic acid, in relation to linoleic.

Question 2: If you got to choose two variables for splitting the two groups, which would you choose, oleic or arachidic, in association with linoleic?

Question 3: Suppose you work with linoleic and arachidic. Would quadratic discriminant analysis produce a better separation than the tree? Argue your viewpoint.

Question 4: Find a linear combination of linoleic and arachidic, and create a new variable to pass to the tree. Re-fit the tree with this variable instead of the original two. What does the model look like now? Is this better than the original tree?

Question 5: In general, why is it often important to create new variables (feature engineering) when building models?


If you need to use the lab time to coordinate the team submission for assignment 3, please do.