Class discussion

Textbook question, chapter 4 Q8

Logistic regression: 20% training error rate, 30% test error rate KNN(K=1): average error rate of 18%

For KNN with K=1, the training error rate is 0% because for any training observation, its nearest neighbor will be the response itself. So, KNN has a test error rate of 36%. I would choose logistic regression because of its lower test error rate of 30%.

Do it yourself

  1. Run the K-Nearest Neighbours classification example, in the textbook section 4.6.5. The code below, fits the model for \(k=1\).
library(tidyverse)
library(ISLR)
library(class)
data(Smarket)
Smarket_tr <- Smarket %>% 
  dplyr::filter(Year < 2005) %>%
  dplyr::select(Lag1, Lag2, Direction)
Smarket_ts <- Smarket %>% 
  dplyr::filter(Year >= 2005) %>%
  dplyr::select(Lag1, Lag2, Direction)
knn.pred <- knn(Smarket_tr[,1:2], Smarket_ts[,1:2],
                Smarket_tr[,3], k=1)
table(knn.pred, Smarket_ts[,3])
#         
# knn.pred Down Up
#     Down   43 58
#     Up     68 83
  1. Compute the test error for \(k=1\)
  2. Re-fit the model for \(k=3\), and compute the test error. How does this compare with the smaller \(k\)?
  3. Fit a range of values for \(k\), and find the best value.
  4. Would you put your money on this classification model, to invest in stock purchases?
  1. Run the linear discriminant analysis for the chocolates data from the lecture notes, and compute the training and test error.

Practice

  1. This data is an oldy, but a goody, and contains physical measurements on three species of flea beetles. You can find it at http://www.ggobi.org/book/data/flea.csv.

Source: Lubischew, A. A. (1962), On the Use of Discriminant Functions in Taxonomy, Biometrics 18, 455–477.

Variable Explanation
species Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri
tars1 width of the first joint of the first tarsus in microns
tars2 width of the second joint of the first tarsus in microns
head the maximal width of the head between the external edges of the eyes in 0.01 mm
aede1 the maximal width of the aedeagus in the fore-part in microns
aede2 the front angle of the aedeagus (1 unit = 7.5 degrees)
aede3 the aedeagus width from the side in microns
  1. Read in the data, and make a scatterplot matrix, with the points coloured by species. Write a few sentences explaining what you learn about the data, and which variables seem to be most promising for distinguishing the species.
library(MASS)
library(caret)
library(GGally)
flea <- read_csv("http://www.ggobi.org/book/data/flea.csv")
ggscatmat(flea, column=2:7, color="species") +
  scale_colour_brewer(palette="Dark2")