Instructions

Marks

Exercises

  1. About the data: The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name.

    1. Use the tour, with type of chocolate mapped to colour, and write a paragraph on whether the two types of chocolate differ on the nutritional variables.

    2. Make a parallel coordinate plot of the chocolates, coloured by type, with the variables sorted by how well they separate the groups. Maybe the “uniminmax” scaling might work best for this data. Write a paragraph explaining how the types of chocolates differ in nutritional characteristics.

    3. Identify one dark chocolate that is masquerading as dark, that is, nutritionally looks more like a milk chocolate. Explain your answer.

  2. Subset the olive oil data to regions 2, 3 only.
    1. Fit a linear discriminant analysis model.
    2. Write down the rule. Make it clear which region is class 1 and class 2 relative to the formula in the notes.
  3. This question is about decision trees. Here is a sample data set to work with:

    # A tibble: 6 x 2
    id    x     class
    1    -4     1
    2     1     1
    3     3     2
    4     4     1
    5     5     1
    6     6     2
    7     8     2
    1. Write down the formulae for the impurity metric, entropy. Show that the entropy function (\(-plog(p)-(1-p)log(1-p))\)) has its highest value at 0.5. Does it matter what base (2, e, 10, …) for the log you use? Explain why a value of 0.5 leads to the worst possible split.
    2. Write an function to compute impurity measure, entropy, for a data partitition.
    3. Use your function to compute the entropy impurity measure for every possible split of the sample data.
    4. Make a plot of your splits and the impurity measure. Where would the split be made?
    5. Subset the olive oil data to regions 2, 3 only. Fit a classification tree to this data. Summarise the model fit, by writing out the decision tree.
    6. Compute entropy impurity measure for all possible splits on linoleic acid. Plot this against the splits. Explain where the best split is.
    7. Compute entropy impurity measure for all possible splits on all of the other variables, except for eicosenoic acid. Plot all of these values against the split, all six plots. Are there other possible candiadates for splitting, that are almost as good as the one chosen by the tree? Explain yourself.
  4. In the simulated data provided, vis_challenge.csv, determine how many groups there are, and whether there are any outliers. Explain your answers.