Class discussion exercises

Textbook questions, chapter 2: 1, 2, 4

  1. better performance
  2. worse performance
  3. better performance
  4. worse performance
  1. regression and inference
  2. classification and prediction
  3. regression and prediction
  1. Lots of different answers here, try to collect the responses from students
  1. spam filters, credit application success, species (animals, plants) labelling,

spam filter: response: ham, spam; predictors: from, subject, words used, …; prediction problem

  1. performance in sports, characteristics that lead to exam scores,

performance in sports: response: fatigue; predictors: length of match, number of rallies, score differential, …; probably inference to understand problem, possibly prediction if need to identify players needing interventions

  1. grouping stamps, paintings, companies

Do it yourself

Textbook question 7

       Obs.   X1   X2   X3  Distance(0, 0, 0)   Y
       ---------------------------------------------
       1      0    3    0   3                   Red 
       2      2    0    0   2                   Red
       3      0    1    3   sqrt(10) ~ 3.2      Red
       4      0    1    2   sqrt(5) ~ 2.2       Green
       5      -1   0    1   sqrt(2) ~ 1.4       Green
       6      1    1    1   sqrt(3) ~ 1.7       Red
  1. Green. Observation #5 is the closest neighbor for K = 1.

  2. Red, because it is the most common of the three responses

  3. Red. Observations #2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and 6 is Red.

Practice

Complete these exercises by writing your responses into an Rmarkdown document. Give your Rmd file to another group member, outputting to html and see if they can knit it.

  1. Download the chocolates data set, and read into R (recommend using read_csv from the tidyverse suite).

About the data: The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name.

library(tidyverse)
choc <- read_csv("http://monba.dicook.org/data/chocolates.csv")
  1. Take a look at the type of variables in the data. If your question is “How do milk and dark chocolates differ?” what type of problem have you got?

This is a classification problem.

  1. Compute the means and standard deviations for milk and dark on each of the variables. Make a nice table summary. (Try using the pipe operator, with the wrangling verbs group_by and summarise, and make the table with the kableExtra package.)
library(kableExtra)
choc %>%
  gather(var, value, Calories:Protein_g) %>% 
  group_by(Type, var) %>%
  summarise(mean = mean(value), sd = sd(value)) %>%
  arrange(var) %>%
  kable(digits = 1) %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Type var mean sd
Dark CalFat 356.1 65.5
Milk CalFat 273.8 63.3
Dark Calories 550.9 62.7
Milk Calories 527.0 57.6
Dark Carbs_g 45.7 14.1
Milk Carbs_g 57.3 8.0
Dark Chol_mg 4.5 7.4
Milk Chol_mg 14.6 9.3
Dark Fiber_g 7.4 3.8
Milk Fiber_g 2.3 1.8
Dark Na_mg 20.2 29.1
Milk Na_mg 76.5 44.5
Dark Protein_g 7.5 1.9
Milk Protein_g 6.7 1.4
Dark SatFat_g 22.7 7.7
Milk SatFat_g 18.3 5.4
Dark Sugars_g 31.1 15.0
Milk Sugars_g 48.5 15.8
Dark TotFat_g 40.0 7.4
Milk TotFat_g 31.5 4.3
  1. Make side-by-side boxplots for each of the variables, for type of chocolate. (Use the grammar of graphics in ggplot2.) Write a paragraph explaining how the type of chocolate differs nutritionally.
choc %>% 
  gather(var, value, Calories:Protein_g) %>%
  ggplot(aes(x = Type, y = value)) + 
  geom_boxplot() + 
  facet_wrap(~ var, scales = "free_y")

Milk chocolates are generally higher on sugar, cholesterol, sodium (Na) and carbs, but lower in calories from fat, and saturated fat. Dark chocolates tend to have more fibre.

  1. Compute two sample t-tests for each of the variables. Which variable most distinguishes the chocolate type? (This may need to be done using the base R function.)
choc %>% 
  gather(var, value, Calories:Protein_g) %>% 
  rowid_to_column() %>% 
  spread(Type, value) %>% 
  group_by(var) %>% 
  summarise(p_val = t.test(Dark, Milk)$p.value) %>% 
  arrange(p_val)
# # A tibble: 10 x 2
#    var          p_val
#    <chr>        <dbl>
#  1 Fiber_g   1.48e-12
#  2 TotFat_g  1.12e- 9
#  3 Na_mg     6.62e- 8
#  4 CalFat    2.10e- 7
#  5 Chol_mg   2.28e- 6
#  6 Sugars_g  4.10e- 6
#  7 Carbs_g   4.44e- 6
#  8 SatFat_g  2.78e- 3
#  9 Protein_g 2.20e- 2
# 10 Calories  7.44e- 2

Fibre is the nutritional item that most distinguishes milk from dark chocolate.