--- title: An Incomplete Introduction to R --- This tutorial is based on the [Base R](http://github.com/rstudio/cheatsheets/raw/master/base-r.pdf) cheat sheet. Other practical cheat sheets, including one on [Advanced R](https://www.rstudio.com/wp-content/uploads/2016/02/advancedR.pdf), one on [R Markdown](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf), and one on the [RStudio IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf) can be found at . --- # About R * R is a programming language for *statistical computing* * It is very popular in the empirical sciences * Strength: *Data analysis* and *visualization* * R is *open source* (in contrast to SPSS for example) * There exists an extensive library of libraries (*packages*) * For the exercises, we will use RStudio with R Markdown --- # A Quick Tour ## Important RStudio Shortcuts * Insert new R code chunk: *Ctrl+Alt+I* * Execute code line: *Ctrl+Enter* * Execute whole chunk: *Ctrl+Shift+Enter* * Run code and render document: *Ctrl+Shift+K* --- ## A Simple Example ```{r} x <- 1:7 y <- x**2 plot(x, y) ``` --- # Basics ## Simple Expressions -- Using R as a Calculator Arithmetic Operations: `+`, `-`, `*`, `/`, `%%`, `%/%` ```{r} 22 + 12 * 5 / 3 22 %/% 5 22 %% 5 ``` Mathematical Functions: `log(x)`, `exp(x)`, `round(x, n)`, `signif(x, n)`, ... ```{r} log(exp(42)) round(1.234, 1) signif(123.4567, 4) ``` Comparison Operators: `==`, `!=`, `<`, `<=`, `>`, `>=` ```{r} 42 >= 23 23 == 42 ``` Boolean Operators: `!`, `&`, `|` ```{r} !TRUE TRUE & FALSE FALSE | 42 ``` Be careful with numbers... ```{r} x <- 123456789123456789 x == x+1 ``` ## Variable Assignments Values are usually assigned to names via `<-` ```{r} x <- 42 x ``` Most of the times, we can also use `=` ```{r} x = 23 x ``` Assignments have a return value and therefore can be nested ```{r} a <- (b <- 23) + 19 x = (y = 23) + 19 ``` In some contexts, we can only use '<-' ```{r} x = mean(y <- 5:9) ``` ## Writing Functions (basics) **Conditional Statements** are of the form `if (condition) {...} else {...}` ```{r} if (42 > 23) { print('foo') } else { print('bar') } ``` **While Loops** are of the form `while (condition) {...}` ```{r} x <- 42.5 while (x > 0) { x <- x - 15 print(x) } ``` **For Loops** are of the form `for (variable in sequence) {...}` ```{r} for (i in 1:5) { print(i) } ``` **Functions** are defined using the `function` keyword: ```{r} factorial <- function(n) { if (n == 0) { 1 } else { n * factorial(n-1) } } factorial(5) ``` * The value of the last evaluated expression is returned * You can also break and return manually using `return(x)` ## Getting Help Need to know how some inbuilt function works? ```{r} ?mean ``` Need to search the help (e.g. to find some particular function)? ```{r} help.search('t-test') ``` Want more information about the type and structure of a given object? ```{r} class(iris) str(iris) ``` --- # Data Structures ## Vectors Vectors are *homogeneous* (all elements are of the same type), *one-dimensional* collections ```{r} typeof(3) typeof(as.integer(3)) typeof('foo') typeof(TRUE) ``` Construction of Vectors ```{r} 3:7 c(2,3,5,7,11) seq(2, 3, by=0.3) rep(c(TRUE, FALSE), times=3) rep(c("foo", "bar"), each=3) ``` Scalars in R are actually vectors with only one element ```{r} c(1) == 1 ``` Most operations work element-wise on arbitrary vectors (and matrices) ```{r} sin(1:5) 1:3 * 4 1:3 * 4:6 1:5 > 3 ``` Some functions on vectors ```{r} x <- c(7,5,11,3,7) sort(x) rev(x) unique(x) table(x) c(x,x) ``` Selecting vector elements and sub-vectors ```{r} x <- c(1,2,3,4,5) x[3] x[-3] x[2:4] x[-(2:4)] x[c(1,3)] x[c(TRUE,FALSE,FALSE,TRUE,FALSE)] x[x < 3] x[x %% 2 == 1] ``` Vectors can also be named ```{r} x <- c(1,2,3) names(x) <- c('foo','bar','baz') x x['bar'] ``` ## Matrices A homogeneous, *two-dimensional* collection (= a vector plus dimensions) ```{r} m <- matrix(1:9, nrow=3, ncol = 3) m m[1,] m[,2] m[3,3] m[8] ``` Matrix Operations: `t(m)`, `m %*% n`, `solve(m, n)` ```{r} t(m) m %*% m ``` ## Lists Lists are *inhomogeneous*, one-dimensional collections ```{r} lst <- list(names=c('Jack', 'Jim'), primes=c(2, 3, 5, 7, 11)) lst ``` We can extract elements ```{r} lst$names lst[[2]] ``` We can extract sublists ```{r} lst[1] lst['primes'] ``` ## Data Frames A special case of a list where all elements have the same length * Comparable to a table in excel * One single element = table column * Length of the elements = number of rows ```{r} df <- data.frame(name = c('Jack', 'Jim'), age = c(23, 42)) df ``` Some functions that work on Matrices and Data Frames ```{r} dim(df) nrow(df) ncol(df) ``` --- # An Example with Plotting Let's first load some data set concerning IMDB movie ratings (from 2005) ```{r} mvs <- read.csv('movies.csv') ``` We can get a first overview over the dataset ```{r} str(mvs) head(mvs) # View(mvs) ``` Now, we usually would form some hypotheses. E.g. * Recent movies are more expensive * Longer movies are more expensive * Longer movies are better We can get a first impression of the data via ```{r} plot(mvs[, c('year', 'length', 'budget', 'rating')]) ``` Let's get a more in-depth look concerning our first hypothesis ```{r} plot(mvs[, c('year', 'budget')]) ``` A box plot for each year might be clearer ```{r} boxplot(budget~year, mvs[mvs[,'year']>1960,]) ``` There clearly seems to be some correlation. We will learn in the lecture how to analyze things like this... So what about that film from '68? Did it really cost 100 million? ```{r} expensive = mvs[,'budget'] > 9e7 old = mvs[,'year'] < 1980 expoldmvs <- mvs[expensive&old,] head(expoldmvs) head(na.omit(expoldmvs)) ``` Other interesting queries might include * What are the worst movies that were more expensive than 100 million dollars? ```{r} expmvs <- na.omit(mvs[mvs[,'budget'] > 1e8,]) head(expmvs[order(expmvs[,'rating']),]) ``` * How many movies have been made each year? ```{r} plot(table(mvs[,'year'])) ``` * What's the distribution of budgets of a movie from 2000? ```{r} hist(mvs[mvs[,'year']==2000,'budget']) ``` * ... --- # Summary * R is a programming language popular in the empirical sciences * There is no scalar type, everything can be done with vectors/matrices * With data frames and the inbuilt functions, it is easy to perform data analysis * Pro Tip: **Use the cheat sheets!**