# Class discussion

For the olive oil data set, the classification tree will use just one variable for its model. It splits on linoleic acid as shown.

# n= 249
#
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
#
# 1) root 249 98 3 (0.3935743 0.6064257)
#   2) linoleic>=1053.5 98  0 2 (1.0000000 0.0000000) *
#   3) linoleic< 1053.5 151  0 3 (0.0000000 1.0000000) *

Question 1: There is no gap. Do you think this might be a problem with future data? Why?

If we look at some other variables, oleic and arachidic acid, in relation to linoleic.

Question 2: If you got to choose two variables for splitting the two groups, which would you choose, oleic or arachidic, in association with linoleic?

Question 3: Suppose you work with linoleic and arachidic. Would quadratic discriminant analysis produce a better separation than the tree? Argue your viewpoint.

Question 4: Find a linear combination of linoleic and arachidic, and create a new variable to pass to the tree. Re-fit the tree with this variable instead of the original two. What does the model look like now? Is this better than the original tree?

Question 5: In general, why is it often important to create new variables (feature engineering) when building models?

# Activities

If you need to use the lab time to coordinate the team submission for assignment 3, please do.