--- title: "ETC3250: Introduction" subtitle: "Semester 1, 2019" author: "
Professor Di Cook

Econometrics and Business Statistics
Monash University" date: "Week 1 (b)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- {r setup, include=FALSE} library(knitr) knitr::opts_chunkset(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, fig.retina = 2) options(htmltools.dir.version = FALSE) library(magick)  class: split-30 .column[.pad10px[ ## Outline - .green[Introduction] - Learning from data ]] .column[.top50px[ - .orange[Better understand] or .orange[make predictions] about a certain phenomenon under study - .orange[Construct a model] of that phenomenon by finding relations between several variables - If phenomenon is complex or depends on a large number of variables, an .orange[analytical solution] might not be available - However, we can .orange[collect data] and learn a model that .orange[approximates] the true underlying phenomenon ]] --- class: split-30 .column[.pad10px[ ## Outline - .green[Introduction] - Learning from data ]] .column[.top50px[ {r fig.width=10, fig.height=4, fig.align='center'} library(tidyverse) library(gapminder) library(gridExtra) p1 <- gapminder %>% filter(country == "Australia") %>% ggplot(aes(x=year, y=lifeExp)) + geom_point() + geom_smooth() + xlab("predictor") + ylab("response") + ggtitle("Regression") + theme(aspect.ratio=1) flea <- read_csv("http://www.ggobi.org/book/data/flea.csv") p2 <- ggplot(flea, aes(x=tars1, y=aede1, colour = species)) + geom_point() + scale_colour_brewer(palette = "Dark2") + xlab("Var 1") + ylab("Var 2") + ggtitle("Classification") + theme(aspect.ratio=1, legend.position="None") p3 <- ggplot(flea, aes(x=tars1, y=aede1)) + geom_point() + xlab("Var 1") + ylab("Var 2") + ggtitle("Clustering") + theme(aspect.ratio=1) grid.arrange(p1, p2, p3, ncol=3) \mathcal{D} = \{(x_i, y_i)\}_{i = 1}^N, ~~~ \mbox{where}~ x_i = (x_{i1}, \dots, x_{ip})^{T}$.boxshadow[.green[Statistical learning] provides a framework for constructing models from the data.] ]] --- class: split-30 .column[.pad10px[ ## Outline - .green[Introduction] - Learning from data - Different learning problems ]] .column[.top50px[ - .green[Supervised] learning,$y_i$.orange[available] for all$x_i$- Regression (or prediction) - Classification - .green[Unsupervised] learning,$y_i$.orange[unavailable] for all$x_i$- .green[Semi-supervised] learning,$y_i$available only for few$x_i$- Other types of learning: reinforcement learning, online learning, active learning, etc. .boxshadow[Being able to .green[identify] which is the type of learning problem you have is important in practice]] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] ]] .column[.top50px[$\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N$where$(y_i, x_i) \sim P(Y, X) = P(X) \underbrace{P(Y|X)}_{}$where$P(Y, X)$means that these arise from some probability distribution.$\sim"$means distributed as, arise from. Typically, we only are interested in$P(Y|X)$, the distribution of$Y$conditional on$X$. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] ]] .column[.top50px[ -$Y = (Y_1, \dots, Y_q)$: response (output) (could be multivariate,$q=1$for us) -$X = (X_1, \dots, X_p)$: set of$p$predictors (input) We seek a function$h(X)$for predicting$Y$given values of the input$X$. This function is computed using$\mathcal{D}$. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] ]] .column[.top50px[$\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N \mbox{ where } (y_i, x_i) \sim P(Y, X)$We are interested in minimizing the expected .orange[out-of-sample] prediction error:$\mbox{Err}_{\mbox{out}}(h) = E[L(Y, h(X))]$where$L(y, {\hat{y}})$is a non-negative real-valued .orange[loss function], such as$L(y, \hat{y}) = (y - \hat{y})^2$and$L(y, \hat{y}) = I(y \neq \hat{y})$. .boxshadow[The goal is that the predictions from the model are accurate for future samples.] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression ]] .column[.top50px[ We often assume that our data arose from a statistical model$Y=f(X) + \varepsilon,$where$f$is the true unknown function,$\varepsilon$is the random error term with$E[\varepsilon] = 0$and is independent of$X$. - The additive error model is a useful approximation to the truth -$f(x) = E[Y|X = x]$- Not a deterministic relationship:$Y \neq f(X)$]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression ]] .column[.nopadding[ {r} if (!file.exists("images/2.2.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.2.pdf", density = 300), "images/2.2.png", format = "png", density = 300) .light-blue[Blue curve] is$f(x)$, the true functional relationship. .font_tiny[(Chapter2/2.2.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression ]] .column[.nopadding[ {r} if (!file.exists("images/2.3.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.3.pdf", density = 300), "images/2.3.png", format = "png", density = 300) .light-blue[Blue surface] is$f(x)$, the true functional relationship. .font_tiny[(Chapter2/2.3.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression - Why? ]] .column[.top50px[ - **Prediction**: -$\hat{y}_{*} = \hat f(x_{*})$for a new observation$x_{*}$- **Inference (or explanation)**: - Which predictors are associated with the response? - What is the relationship between the response and each predictor? ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression - Why? - Estimation ]] .column[.pad10px[ {r} if (!file.exists("images/2.3.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.3.pdf", density = 300), "images/2.3.png", format = "png", density = 300)  {r} if (!file.exists("images/2.4.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.4.pdf", density = 300), "images/2.4.png", format = "png", density = 300)  Linear model:$\hat f(\mbox{education}, \mbox{seniority}) =~~~~~~~~\hat \beta_0 + \hat \beta_1 \times \mbox{education} + \hat \beta_2 \times \mbox{seniority}$.boxshadow[Why would we ever choose to use a **more restrictive method** instead of a **very flexible approach**?] .font_tiny[(Chapter2/2.3.pdf, 2.4.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression - Why? - Estimation ]] .column[.nopadding[ {r} if (!file.exists("images/2.5.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.5.pdf", density = 300), "images/2.5.png", format = "png", density = 300) if (!file.exists("images/2.6.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.6.pdf", density = 300), "images/2.6.png", format = "png", density = 300)    .font_tiny[(Chapter2/2.3.pdf, 2.4.pdf, 2.5.pdf, 2.6.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression - Why? - Estimation - Methods ]] .column[.nopadding[ {r} library(emo)  - .orange[Parametric methods] r set.seed(1000); emo::ji("math") Assumption about the form of$f$, e.g. linear r set.seed(1000); emo::ji("smile") The problem of estimating$f$reduces to estimating a set of parameters r set.seed(1000); emo::ji("smile") Usually a good starting point for many learning problems r set.seed(1000); emo::ji("frown") Poor performance if linearity assumption is wrong - .orange[Non-parametric methods] r set.seed(1000); emo::ji("smile") No *explicit* assumptions about the form of$f$, e.g. nearest neighbours:$\hat Y(x) = \frac1k \sum_{x_i \in N_k(x)} y_i$r set.seed(1000); emo::ji("smile") High flexibility: it can potentially fit a range of shapes r set.seed(1000); emo::ji("frown") A large number of observations is required to estimate$f$with good accuracy ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression ]] .column[.top50px[ Suppose we have a regression model$y=f(x)+\varepsilon$. .orange[Estimate]$\hat{f}$from some .orange[training data],$Tr=\{x_i,y_i\}_{i=1}^n$. One common measure of accuracy is: .orange[Training Mean Squared Error]$MSE_{Tr} = \mathop{\mbox{Ave}}\limits_{i\in Tr}[y_i-\hat{f}(x_i)]^2 = \frac{1}{n}\sum_{i=1}^n [(y_i-\hat{f}(x_i)]^2$]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression ]] .column[.top50px[ Suppose we have a regression model$y=f(x)+\varepsilon$. .orange[Estimate]$\hat{f}$from some .orange[training data],$Tr=\{x_i,y_i\}_{i=1}^n$. One common measure of accuracy is: .orange[Training Mean Squared Error]$MSE_{Tr} = \mathop{\mbox{Ave}}\limits_{i\in Tr}[y_i-\hat{f}(x_i)]^2 = \frac{1}{n}\sum_{i=1}^n [(y_i-\hat{f}(x_i)]^2$Measure .orange[real accuracy] using .orange[test data]$Te=\{x_j,y_j\}_{j=1}^m$, .orange[Test Mean Squared Error]$MSE_{Te} = \mathop{\mbox{Ave}}\limits_{j\in Te}[y_j-\hat{f}(x_j)]^2 = \frac{1}{m}\sum_{j=1}^m [(y_j-\hat{f}(x_j)]^2$]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs ]] .column[.top50px[ - In general, the more .orange[flexible] a method is, the .orange[lower] its .orange[training MSE] will be. i.e. it will “fit” the training data very well. - However, the .orange[test MSE] may be .orange[higher] for a more .orange[flexible] method than for a simple approach like linear regression. - Flexibility also makes interpretation more difficult. There is a trade-off between .orange[flexibility] and .orange[model interpretability]. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - .green[Supervised learning] - Regression - Why? - Estimation - Methods - Interpretability vs flexibility ]] .column[.nopadding[ {r} if (!file.exists("images/2.7.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.7.pdf", density = 300), "images/2.7.png", format = "png", density = 300) Simplistic overview of methods on the flexibility vs interpretability scale. Interpretability is when it is clear how the explanatory variable is related to the response, e.g. linear model. .orange[Poor interpretability] is often called a .orange["black box"] method. .font_tiny[(Chapter2/2.7.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example ]] .column[.top50px[ .split-two[ .row[ {r} if (!file.exists("images/2.9.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.9.pdf", density = 300), "images/2.9.png", format = "png", density = 300) .font_tiny[(Chapter2/2.9.pdf)] ] .row[ .split-50[ .column[ .black[true curve] .orange[linear regression] .green[Smoothing] splines ] .column[ .gray[Training MSE] .red[Test MSE] .black[Dashed: Minimum test MSE] ] ] ] ] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example ]] .column[.top50px[ .split-two[ .row[ {r} if (!file.exists("images/2.10.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.10.pdf", density = 300), "images/2.10.png", format = "png", density = 300) .font_tiny[(Chapter2/2.9.pdf)] ] .row[ .split-50[ .column[ .black[true curve] .orange[linear regression] .green[Smoothing] splines ] .column[ .gray[Training MSE] .red[Test MSE] .black[Dashed: Minimum test MSE] ] ] ] ] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example ]] .column[.top50px[ .split-two[ .row[ {r} if (!file.exists("images/2.11.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.11.pdf", density = 300), "images/2.11.png", format = "png", density = 300) .font_tiny[(Chapter2/2.11.pdf)] ] .row[ .split-50[ .column[ .black[true curve] .orange[linear regression] .green[Smoothing] splines ] .column[ .gray[Training MSE] .red[Test MSE] .black[Dashed: Minimum test MSE] ] ] ] ] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example - Bias-variance tradeoff ]] .column[.top50px[ .boxshadow[There are two competing forces that govern the choice of learning method: .orange[bias] and .orange[variance].] .orange[Bias] is the error that is introduced by modeling a complicated problem by a simpler problem. - For example, linear regression assumes a linear relationship when few real relationships are exactly linear. - In general, the .orange[more flexible] a method is, the .orange[less bias] it will have. [This site](https://degreesofbelief.roryquinn.com/bias-variance-tradeoff) has a lovely explanation, if you don't like mine. ]] --- class: split-30 count: false .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example - Bias-variance tradeoff ]] .column[.top50px[ .boxshadow[There are two competing forces that govern the choice of learning method: .orange[bias] and .orange[variance].] .orange[Variance] refers to how much your estimate would change if you had different training data. - In general, the .orange[more flexible] a method is, the .orange[more variance] it has. - The .orange[size] of the training data has an impact on the variance. ]] --- ## MSE decomposition If$Y = f(x) + \varepsilon$and$f(x)=\mbox{E}[Y\mid X=x]$, then the expected **test** MSE for a new$Y$at$x_0$will be equal to$E[(Y-\hat{f}(x_0))^2] = [\mbox{Bias}(\hat{f}(x_0))]^2 + \mbox{Var}(\hat{f}(x_0)) + \mbox{Var}(\varepsilon)$Test MSE = Bias$^2$+ Variance + Irreducible variance - The expectation averages over the variability of$Y$as well as the variability in the training data. - As the flexibility of$\hat{f}$increases, its variance increases and its bias decreases. - Choosing the flexibility based on average test MSE amounts to a .orange[bias-variance trade-off] --- .nopadding[ {r} if (!file.exists("images/2.12.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.12.pdf", density = 300), "images/2.12.png", format = "png", density = 300) .green[squared bias], .orange[variance], Var(ε) (dashed line), and .red[test MSE] for the three data sets shown earlier. The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE. .font_tiny[(Chapter2/2.12.pdf)] ] --- class: split-30 .column[.nopadding[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Regression - Training vs Test MSEs - Example - Bias-variance tradeoff - Optimal prediction ]] .column[.top50px[ The optimal MSE is obtained when$\hat{f}=f = \mbox{E}[Y\mid X=x].$Then .orange[bias=variance=0] and$\mbox{MSE} = \mbox{irreducible variance}$This is called the .orange[oracle predictor] because it is not achievable in practice. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification ]] .column[.nopadding[ Here the response variable$Y$is .orange[qualitative]. - e.g., email is one of${\cal C} = (\mbox{spam},\mbox{ham})$- e.g., voters are one of${\cal C} = (\mbox{Liberal},\mbox{Labor},\mbox{Green},\mbox{National},\mbox{Other})$Our goals are: 1. Build a classifier$C(x)$that assigns a class label from${\cal C} = \{\mathcal{C}_1, \dots, \mathcal{C}_K\}$to a future unlabeled observation$x$. 2. Such a classifier will divide the input space into regions$\mathcal{R}_k$called decision regions, one for each class, such that all points in$\mathcal{R}_k$are assigned to class$\mathcal{C}_k$3. Assess the uncertainty in each classification (i.e., the probability of misclassification). 4. Understand the roles of the different predictors among$X = (X_1,X_2,\dots,X_p)$. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification ]] .column[.top50px[ Recall that we want to minimize the expected prediction error$E_{(Y, X)}[L(Y, C(X))]$where$L(y, {\hat{y}})$is a non-negative real-valued .orange[loss function]. In classification, the output$Y$is a .orange[categorical variable], and our loss function can be represented by a$K \times K$matrix$L$, where$K = \mbox{card}(\mathcal{C})$.$L(k, l)$is the cost of classifying$C_k$as$C_l$. With zero-one loss, i.e.$L(y, \hat{y}) = I(y \neq \hat{y})$the cost is equal for mistakes between any class. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Optimal classifier - Error ]] .column[.top50px[ .orange[Compute]$\hat{C}$from some .orange[training data],$Tr=\{x_i,y_i\}_1^n$. In place of MSE, we now use the error rate (fraction of misclassifications). *Training Error Rate*$\text{Error rate}_{Tr} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \hat{C}(x_i))$Measure .orange[real accuracy] using .orange[test data]$Te=\{x_j,y_j\}_1^m$*Test Error Rate*$\text{Error rate}_{Te} = \frac{1}{m}\sum_{j=1}^m I(y_j \ne \hat{C}(x_j))$]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier ]] .column[.top50px[ Let${\cal C} = \{\mathcal{C}_1, \dots, \mathcal{C}_K\}$, and let$p_k(x) = \text{P}(Y = C_k\mid X = x),\qquad k = 1, 2, \dots , K.$These are the .orange[conditional class probabilities] at$x$. Then the .orange[Bayes] classifier at$x$is$C(x) = C_j \quad \mbox{ if } p_j(x) = \max\{p_1(x), p_2(x), \dots, p_K(x)\}$*This gives the minimum average test error rate.* ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier ]] .column[.top50px[$1-\text{E}\left(\max_j \text{P}(Y = C_j | X)\right)$- The .orange[Bayes error rate] is the lowest possible error rate that could be achieved if we knew exactly the .orange[true] probability distribution of the data. - It is analogous to the irreducible error in regression. - On test data, no classifier can get lower error rates than the Bayes error rate. - In reality, the Bayes error rate is not known exactly. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier ]] .column[.top50px[ {r} if (!file.exists("images/2.13.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.13.pdf", density = 300), "images/2.13.png", format = "png", density = 300) .font_tiny[(Chapter2/2.13.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier - kNN ]] .column[.pad10px[ One of the simplest classifiers. Given a test observation$x_0$, - Find the$K$nearest points to$x_0$in the training data:${\cal N}_0$. - Estimate conditional probabilities$=P(Y = C_j \mid X=x_0) = \frac{1}{K}\sum_{i\in {\cal N}_0} I(y_i = C_j).$- Classify$x_0\$ to class with largest probability. ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier - kNN ]] .column[.top50px[ {r} if (!file.exists("images/2.14.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.14.pdf", density = 300), "images/2.14.png", format = "png", density = 300) .font_tiny[(Chapter2/2.14.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier - kNN ]] .column[.top50px[ {r} if (!file.exists("images/2.16.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.16.pdf", density = 300), "images/2.16.png", format = "png", density = 300) .font_tiny[(Chapter2/2.16.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier - kNN ]] .column[.top50px[ {r} if (!file.exists("images/2.15.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.15.pdf", density = 300), "images/2.15.png", format = "png", density = 300) .font_tiny[(Chapter2/2.15.pdf)] ]] --- class: split-30 .column[.pad10px[ ## Outline - Introduction - Supervised learning - .green[Assessing model accuracy] - Classification - Error - Bayes classifier - kNN ]] .column[.top50px[ {r} if (!file.exists("images/2.17.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.17.pdf", density = 300), "images/2.17.png", format = "png", density = 300) .font_tiny[(Chapter2/2.17.pdf)] ]] --- ### A fundamental picture --- layout: false # r set.seed(2019); emo::ji("technologist") Made by a human with a computer ### Slides at [https://monba.dicook.org](https://monba.dicook.org). ### Code and data at [https://github.com/dicook/Business_Analytics](https://github.com/dicook/Business_Analytics).
### Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme). 