Getting up and running with the computer:

What is R?

https://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html

From Wikipedia: ``R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.’’

R is free to use and has more than 14,000 (Feb 2019) user contributed add-on packages on the Comprehensive R Archive Network (CRAN).

What is RStudio?

From Julie Lowndes:

If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.

The RStudio integrated development environment (IDE) has multiple components including:

  1. Source editor (to edit your scripts):
  1. Console window (to run your scripts, to test small pieces of code):
  1. Other tabs/panes:

There’s a cheatsheet in the “Help” menu, on tips for using this interface.

RStudio Projects

Using projects to organise your work

Using projects to organise your work

Exercise 1

Create a project for this unit, in the directory.

Exercise 2

Download the lab1.Rmd from the course web site.

What is RMarkdown?

There’s a cheatsheet in the “Help” pages of RStudio on Rmarkdown.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Equations can be included using LaTeX (https://latex-project.org/) commands like this:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2.$$

produce

\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2.\]

We can also use inline mathematical symbols such as $\alpha$ and $\infty$, which produce \(\alpha\) and \(\infty\), respectively.

For more details on using R Markdown see http://rmarkdown.rstudio.com. Spend a few minutes looking over that website before continuing with this document.

Exercise 3

Look at the text in the lab1.Rmd document.

Some R Basics

(100+2)/3
5*10^2
1/0
0/0
(0i-9)^(1/2)
sqrt(2*max(-10,0.2,4.5))+100
x <- sqrt(2*max(-10,0.2,4.5))+100
x
log(100)
log(100,base=10)

Data Types

Operations

set.seed(1000)
x <- rnorm(6)
x
# [1] -0.44577826 -1.20585657  0.04112631  0.63938841 -0.78655436 -0.38548930
sum(x + 10)
# [1] 57.85684

x[1]
# [1] -0.4457783
x[c(T, F, T, T, F, F)]
# [1] -0.44577826  0.04112631  0.63938841

x <- list(
  a = 10,
  b = c(1, "2")
)
x$a
# [1] 10
x[["a"]]
# [1] 10
x["a"]
# $a
# [1] 10

Examining ‘structure’

str(x)
# List of 2
#  $ a: num 10
#  $ b: chr [1:2] "1" "2"

Missing Values

x <- c(50, 12, NA, 20)
mean(x)
# [1] NA
mean(x, na.rm=TRUE)
# [1] 27.33333

Counting Categories

table(c(1, 2, 3, 1, 2, 8, 1, 4, 2))
# 
# 1 2 3 4 8 
# 3 3 1 1 1

Functions

One of the powerful aspects of R is to build on the reproducibility. If you are going to do the same analysis over and over again, compile these operations into a function that you can then apply to different data sets.

average <- function(x)
{
  return(sum(x)/length(x))
}

y1 <- c(1,2,3,4,5,6)
average(y1)
# [1] 3.5

y2 <- c(1, 9, 4, 4, 0, 1, 15)
average(y2)
# [1] 4.857143

Now write a function to compute the mode of some vector, and confirm that it returns 4 when applied on y <- c(1, 1, 2, 4, 4, 4, 9, 4, 4, 8)

Exercise 4

Getting data

Data can be found in R packages

library(tidyverse)
data(economics, package = "ggplot2")
# data frames are essentially a list of vectors
glimpse(economics)
# Observations: 574
# Variables: 6
# $ date     <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967-1…
# $ pce      <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2,…
# $ pop      <int> 198712, 198911, 199113, 199311, 199498, 199657, 199808,…
# $ psavert  <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 1…
# $ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, …
# $ unemploy <int> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877, 2…

These are not usually kept up to date but are good for practicing your analysis skills on.

Or in their own packages

library(gapminder)
glimpse(gapminder)
# Observations: 1,704
# Variables: 6
# $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Af…
# $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
# $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, …
# $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854…
# $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 148803…
# $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.…

I primarily use the readr package (part of the tidyverse suite) for reading data now. It mimics the base R reading functions but is implemented in C so reads large files quickly, and it also attempts to identify the types of variables.

candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
glimpse(candy)
# Observations: 85
# Variables: 13
# $ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One q…
# $ chocolate        <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
# $ fruity           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,…
# $ caramel          <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
# $ peanutyalmondy   <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ nougat           <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
# $ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ hard             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
# $ bar              <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
# $ pluribus         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,…
# $ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604…
# $ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767…
# $ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.3414…

You can pull data together yourself, or look at data compiled by someone else.

Question 1

Question 2

Question 3

Question 4

  1. Read in the OECD PISA data (file student_sub.rds is available at from the course web site)
  2. Tabulate the countries (CNT)
  3. Extract the values for Australia (AUS) and Shanghai (QCN)
  4. Compute the average and standard deviation of the reading scores (PV1READ), for each country
  5. Write a few sentences explaining what you learn about reading in these two countries.

Homework

Using your free DataCamp account, work your way through the free tutorial Introduction to R. This provides some good insights on the data types you will commonly use in R.

Got a question?

It is always good to try to solve your problem yourself first. Most likely the error is a simple one, like a missing “)” or “,”. For deeper questions about packages, analyses and functions, making your Rmd into a document, or simply the error that is being generated, you can often google for an answer. Often, you will be directed to Q/A site: http://stackoverflow.com.

Stackoverflow is a great place to get answers to tougher questions about R and also data analysis. You always need to check that someone hasn’t asked it before, the answer might already be available for you. If not, make a reproducible example of your problem, following the guidelines here and ask away. Remember these people that kindly answer questions on stackoverflow have day jobs too, and do this community support as a kindness to all of us.