Exercises

  1. About the data: The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name.
  1. (2)Use the tour, with type of chocolate mapped to colour, and write a paragraph on whether the two types of chocolate differ on the nutritional variables.

The two groups are different from each other, but they are not spearated groups. There are some chocolates of each type that are more similar to the other type on these nutritional charateristics.

  1. (2)Make a parallel coordinate plot of the chocolates, coloured by type, with the variables sorted by how well they separate the groups. Maybe the “uniminmax” scaling might work best for this data. Write a paragraph explaining how the types of chocolates differ in nutritional characteristics.

The trend in lines differs for the two types of chocolates. Milk chocolates tend to have higher values on the variables Na, Chol, Sugars and Carbs, and Dark chocolates tend to have higher values on Fiber, TotFat, CalFat, SatFat. Protein and Calories tend not to be different on chocolate types.

  1. (2)Identify one dark chocolate that is masquerading as dark, that is, nutritionally looks more like a milk chocolate. Explain your answer.

Mars Dark chocolate is one that appears to be more nutritionally similar to milk chocolate.

  1. Subset the olive oil data to regions 2, 3 only.
    1. Fit a linear discriminant analysis model.
    2. (3)Write down the rule. Make it clear which region is class 1 and class 2 relative to the formula in the notes.
## Call:
## lda(region ~ ., data = olive[, -c(1, 3)])
## 
## Prior probabilities of groups:
##         2         3 
## 0.3935743 0.6064257 
## 
## Group means:
##   palmitic palmitoleic  stearic    oleic  linoleic linolenic arachidic
## 2 1111.347     96.7449 226.1837 7268.020 1196.5306  27.09184  73.17347
## 3 1094.801     83.7351 230.8013 7793.053  727.0331  21.78808  37.57616
##   eicosenoic
## 2   1.938776
## 3   1.973510
## 
## Coefficients of linear discriminants:
##                      LD1
## palmitic     0.003883828
## palmitoleic  0.013671576
## stearic      0.017953780
## oleic        0.003574435
## linoleic    -0.007004066
## linolenic   -0.012046402
## arachidic   -0.019400750
## eicosenoic  -0.120815557
##        LD1
## 2 25.31806
## 3 31.07424
##           LD1
## [1,] 28.19615

Assign a new observation to region 3 if 0.003883828palmitic+0.013671576palmitoleic+ 0.017953780stearic+0.003574435oleic- 0.007004066linoleic-0.012046402linolenic- 0.019400750arachidic-0.120815557eicosenoic+28.19615>0

  1. This question is about decision trees. Here is a sample data set to work with:
# A tibble: 6 x 2
id    x     class
1    -4     1
2     1     1
3     3     2
4     4     1
5     5     1
6     6     2
7     8     2
  1. (2)Write down the formulae for the impurity metric, entropy. Show that entropy \(plog p\) has its highest value at 0.5. Explain why a value of 0.5 leads to the worst possible split.

\[-plog(p)-(1-p)log(1-p)\]

A value of 0.5 corresponds to a group with even number of observations in the subset. Thus, it is a very mixed group, and not pure.

  1. (2)Write an function to compute impurity measure, entropy, for a data partitition.
myentropy <- function(p) {
  if (p>0 && p<1) {
    ent <- -p*log(p)-(1-p)*log(1-p)
  }
  else {
    ent <- 0
  }
  return(ent)
}
# This only works for two classes, one variable
mysplit <- function(x, spl, cl) {
  n <- length(x)
  cl_unique <- unique(cl)
  left <- x[x<spl]
  cl_left <- cl[x<spl]
  n_l <- length(left)
  right <- x[x>=spl]
  cl_right <- cl[x>=spl]
  n_r <- length(right)
  p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l
  p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r
  if (is.na(p_l)) p_l<-0.5
  if (is.na(p_r)) p_r<-0.5
  impurity <- (n_l/n)*myentropy(p_l) + (n_r/n)*myentropy(p_r)
  return(impurity)
}
  1. (2)Use your function to compute the entropy impurity measure for every possible split of the sample data.
## # A tibble: 6 x 2
##   split   imp
##   <dbl> <dbl>
## 1  -1.5 0.594
## 2   2   0.481
## 3   3.5 0.669
## 4   4.5 0.594
## 5   5.5 0.357
## 6   7   0.546
  1. (2)Make a plot of your splits and the impurity measure. Where would the split be made?

  1. (2)Subset the olive oil data to regions 2, 3 only. Fit a classification tree to this data. Summarise the model fit, by writing out the decision tree.
## n= 249 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 249 59.42972 2.606426  
##   2) linoleic>=1053.5 98  0.00000 2.000000 *
##   3) linoleic< 1053.5 151  0.00000 3.000000 *
  1. (2)Compute entropy impurity measure for all possible splits on linoleic acid. Plot this against the splits. Explain where the best split is.

  2. (3)Compute entropy impurity measure for all possible splits on all of the other variables, except for eicosenoic acid. Plot all of these values against the split, all six plots. Are there other possible candidates for splitting, that are almost as good as the one chosen by the tree? Explain yourself.