Textbook questions, chapter 2: 1, 2, 4

- better performance
- worse performance
- better performance
- worse performance

- regression and inference
- classification and prediction
- regression and prediction

- Lots of different answers here, try to collect the responses from students

- spam filters, credit application success, species (animals, plants) labelling,

spam filter: response: ham, spam; predictors: from, subject, words used, …; prediction problem

- performance in sports, characteristics that lead to exam scores,

performance in sports: response: fatigue; predictors: length of match, number of rallies, score differential, …; probably inference to understand problem, possibly prediction if need to identify players needing interventions

- grouping stamps, paintings, companies

Textbook question 7

```
Obs. X1 X2 X3 Distance(0, 0, 0) Y
---------------------------------------------
1 0 3 0 3 Red
2 2 0 0 2 Red
3 0 1 3 sqrt(10) ~ 3.2 Red
4 0 1 2 sqrt(5) ~ 2.2 Green
5 -1 0 1 sqrt(2) ~ 1.4 Green
6 1 1 1 sqrt(3) ~ 1.7 Red
```

Green. Observation #5 is the closest neighbor for K = 1.

Red, because it is the most common of the three responses

Red. Observations #2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and 6 is Red.

Complete these exercises by writing your responses into an Rmarkdown document. Give your Rmd file to another group member, outputting to `html`

and see if they can `knit`

it.

- Download the chocolates data set, and read into R (recommend using
`read_csv`

from the`tidyverse`

suite).

*About the data:* The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name.

```
library(tidyverse)
choc <- read_csv("http://monba.dicook.org/data/chocolates.csv")
```

- Take a look at the type of variables in the data. If your question is “How do milk and dark chocolates differ?” what type of problem have you got?

*This is a classification problem.*

- Compute the means and standard deviations for milk and dark on each of the variables. Make a nice table summary. (Try using the pipe operator, with the wrangling verbs
`group_by`

and`summarise`

, and make the table with the`kableExtra`

package.)

```
library(kableExtra)
choc %>%
gather(var, value, Calories:Protein_g) %>%
group_by(Type, var) %>%
summarise(mean = mean(value), sd = sd(value)) %>%
arrange(var) %>%
kable(digits = 1) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
```

Type | var | mean | sd |
---|---|---|---|

Dark | CalFat | 356.1 | 65.5 |

Milk | CalFat | 273.8 | 63.3 |

Dark | Calories | 550.9 | 62.7 |

Milk | Calories | 527.0 | 57.6 |

Dark | Carbs_g | 45.7 | 14.1 |

Milk | Carbs_g | 57.3 | 8.0 |

Dark | Chol_mg | 4.5 | 7.4 |

Milk | Chol_mg | 14.6 | 9.3 |

Dark | Fiber_g | 7.4 | 3.8 |

Milk | Fiber_g | 2.3 | 1.8 |

Dark | Na_mg | 20.2 | 29.1 |

Milk | Na_mg | 76.5 | 44.5 |

Dark | Protein_g | 7.5 | 1.9 |

Milk | Protein_g | 6.7 | 1.4 |

Dark | SatFat_g | 22.7 | 7.7 |

Milk | SatFat_g | 18.3 | 5.4 |

Dark | Sugars_g | 31.1 | 15.0 |

Milk | Sugars_g | 48.5 | 15.8 |

Dark | TotFat_g | 40.0 | 7.4 |

Milk | TotFat_g | 31.5 | 4.3 |

- Make side-by-side boxplots for each of the variables, for type of chocolate. (Use the grammar of graphics in
`ggplot2`

.) Write a paragraph explaining how the type of chocolate differs nutritionally.

```
choc %>%
gather(var, value, Calories:Protein_g) %>%
ggplot(aes(x = Type, y = value)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free_y")
```

*Milk chocolates are generally higher on sugar, cholesterol, sodium (Na) and carbs, but lower in calories from fat, and saturated fat. Dark chocolates tend to have more fibre.*

- Compute two sample t-tests for each of the variables. Which variable most distinguishes the chocolate type? (This may need to be done using the base R function.)

```
choc %>%
gather(var, value, Calories:Protein_g) %>%
rowid_to_column() %>%
spread(Type, value) %>%
group_by(var) %>%
summarise(p_val = t.test(Dark, Milk)$p.value) %>%
arrange(p_val)
# # A tibble: 10 x 2
# var p_val
# <chr> <dbl>
# 1 Fiber_g 1.48e-12
# 2 TotFat_g 1.12e- 9
# 3 Na_mg 6.62e- 8
# 4 CalFat 2.10e- 7
# 5 Chol_mg 2.28e- 6
# 6 Sugars_g 4.10e- 6
# 7 Carbs_g 4.44e- 6
# 8 SatFat_g 2.78e- 3
# 9 Protein_g 2.20e- 2
# 10 Calories 7.44e- 2
```

Fibre is the nutritional item that most distinguishes milk from dark chocolate.