ATTENTION ATTENTION! This online book is currently in the development stage. Its content and perhaps also its layout will change from time to time. Please do not consider it as a finished work, yet!

I started working on it in June 2016 and will continue working on it whenever there is time and a mood for it.

Another or an other book?

It is obvious that there is already a glamorous body of books about using R, some of them with an appealing flavour of writing, some rather straightforward, some partly cryptic. R is described from so many different angles and user perspectives, one might indeed (and in the case of this book absolutely obviously) ask: Why another book?

Don’t think I wrote this book just for the splendid reader. I am also a bit egoistic and writing things up let’s one think about them and question the correctness of one’s daily work and the approaches that come to the mind while using R. But there is also another reason for me to write down essential parts and patches about R and how I understand and use it. While most books arrange their content logically according to a common workflow (my main chapters actually do this, as well) or from small to big or simple to complex, this book is simply an organised collection of receipes, arranged by the purpose of application. There will be only minimum introduction to the content of each chapter and there will usually be no such thing like a didactic arrangement of chapter content. Rather there will be task-oriented code snippets optimised for copy-paste action.

There are some exceptions. One is the chapter Kaleidocsope, which is just a pure collection of graphs, maps and other materials produced with R that I thought would be interesting to show. At least I had great fun producing or reproducing them.

Basics

I got introduced to R on the backseat of a van during a four week long field excursion through the southwestern United States. With a laptop on my knees (and sufficient power from a 230 V plug) I reproduced the sparse but elegant exampes from the R introduction pdf file that was installed along with R. Seven years later, I use R almost daily for an extremely wide prupose and consider R one of the most versatile, flexible, elegant tools I ever encountered. Well, this might have to do with where my research interests graded to. From the geographers early playground, fieldwork, to abstract analysis of natural phenomena mainly driven by numeric data at different levels of quantity, quality and of course intended application.

This ongoing transition is certainly reflected by the overall quality of the book and specifically by the degree of elaboration of individual chapters. Whenever I read code I wrote some years or even only months ago, I feel between depressed, amused and disgusted – for different reasons (correctness, clarity, style, and so on), but mostly simply for “why did I not do it like I would do it now?” Well, the answer is clear. But again, R does not demand a steep learning curve. There is really little overhead and the formal requirements are low, compared to other languages. So this chapter just gives some very brief and mainly unjustified definitions of essential foundations and phrases of the typical R jargon that you might want to be familiar with to have fun with this book.

Useful references that discuss the following items in a bit more detail and with official binding to R are:

The R Language Definition

Data types

R supports the usual data types known from many other programs but also further, more specialised ones. Data types define how values are handled and which operations are possible with them. Different data types can be converted among each other to some degree, if certain constraints are respected that mainly arise from the hierarchy position of a data type. Conversion is always possible in the following direction:

logical to integer to double to complex to character

Conversion in the other direction usually results in loss of information or is only possible in exceptional situations. This is because conversion is no problem with increasing hierarchy level of data. The hierarchy level defines allowed values and operations. While, for example, logical data can only be TRUE or FALSE, character data can be of any alpha-numeric value. Arithmetic operations and quantitative comparisons are only possible up to the level of complex data, but not for character. It is essential to know what is possible with what kind of data type and how to convert between data types (and of course being aware of the consequences of conversion).

The following list contains the commonly experienced data types in R (the ones shown above) as well as some further data types that are also encoutered and deserve a brief discussion:

Integer and Double, i.e., numeric. Integer values are numeric values without decimal extension. Double values are numeric values with a decimal extension. Practically R does not make a difference between these two data types. It handles both as numeric data types. To convert a double object to an integer object, use as.integer(). This will truncate a fractional number rather than rounding it. Hence, it may be necessary to round first (as.integer(round(x, digits = 0))). A double object can be generated by numeric() and converted to from another data type by as.numeric(). Whether an object’s data type is double or integer can be querried with is.numeric() or is.integer(). More generally, the data type can be querried with typeof().
Complex. Complex objects contain a real and an imaginary part, both of them are numeric. Complex data can be generated by complex(). To extract the real part use Re(), to extract the imaginary part use Im(). Whether an object’s data type is complex can be querried with is.complex(). Objects can be converted to complex with as.complex().
Logical. Logical data can be of two cases: TRUE (1) or FALSE (0) and are usually the result of a logical operation using Boolean operators. It is possible to do calculations with them, i.e., their representations by 1 and 0. Actually, data of type logical can also be NA, though this will rarely be of practical use. For the lazy ones, R supports specifying T instead of TRUE and F instead of FALSE. To create logical objects use logical(). To convert any data type to logical use as.logical(). Note however, that all values will be converted to TRUE except for 0 (FALSE) or NA (NA). To query if an object is logical use is.logical() - the return value will be a logical value actually.
Character. Literally a character is only a single letter, number, symbol, digit and so on, i.e., the smallest addressable unit possible. In R (and many other software/languages) the data type character means a string (also if it contains only one character) of alpha-numeric characters. The data type character is the most open data type in R, nearly everything can be converted to character. However, this comes at the cost of reduced possibilities of data treatment. You cannot do any calculations with characters (e.g., "1" + "1" will give an error). Objects of type character are denoted by double quotation marks ("") or single quotation marks (''), though the latter is less common but may be needed when using paste() to generate long text strings with quotation marks in them theirselfs. Chararcter objects can be created using character(), queried by is.character() and objects can be converted to characters using as.character().
Time/Date. Handling dates and times is a wide field (and accoringly deserved a chapter on its own). In R, these data types are represented by the POSIX (Portable Operating System Interface) format. There are two different classes: POSIXct (ct stands for calendar time) and POSIXlt (lt stands for local time). POSIXct values are nothing else than a numeric description of the number of seconds that have passed since a given reference date (in the case of R 1970-01-01 00:00:00 UTC). Hence, you can do arithmetics and any other calculas as with other data of type numeric. However, specifying and displaying dates requires a long string, denoting year, month, day, hour, minute and second (maybe with fractions) in a predefined format. Querying the data type is perfomed with is.POSXct() and conversion to it is done by as.POSIXct(). Defining POSIXct formats is more complicated. To convert between different character representations of POSIX data use the function strptime(). To extract parts of a POSIX date and return them as character string use strftime() (a wrapper for format.POSIXlt()). For more information see chapter Time series
NA, NaN, NULL. Missing values in the statistical sense, that is, variables whose value is not known, have the value NA. The default type of NA is logical, unless coerced to some other type. Testing for NA is done using is.na. to specify an explicit string NA should use ‘as.character(NA)’ rather than “NA”. Numeric calculations whose result is undefined, such as ‘0/0’, produce the value NaN. This exists only in the double type and for real or imaginary components of the complex type. The function is.nan is provided to check specifically for NaN, is.na also returns TRUE for NaN. Coercing NaN to logical or integer type gives an NA of the appropriate type, but coercion to character gives the string “NaN”. NaN values are incomparable so tests of equality or collation involving NaN will result in NA. They are regarded as matching any NaN value (and no other value, not even NA) by match. There is a special object called NULL. It is used whenever there is a need to indicate or specify that an object is absent. It should not be confused with a vector or list of zero length. The NULL object has no type and no modifiable properties. There is only one NULL object in R, to which all instances refer. To test for NULL use is.null. You cannot set attributes on NULL. NULL can be used, for example to remove list elements by assigning them NULL
Further types. There are for example raw and user-defined types which are actually derivatives of the six atomic data types of R. They are thus not described in this scope.

Data structures

Data structures are perhaps unknown to users of spreadsheet software, simply because spreadsheet software does not care about data structures, it uses spaghetti data. The structure of data defines how the values are organised and, thus, can be accessed (or indexed, see next chapter). Obviously, data structures are only relevant if the data contain (significantly) more than one value.

There is one paramount function that gives access to how data is structured: str(). It returns not just the name of the specific data structure but also information about the contained object names, included data types, dimensions and the first few values. Thus, knowing and using str() is essential to avoid pitfalls and wrong assumptions when writing scripts. Somehow similar to str() is the function dim(), which returns (or sets) the dimensions of objects, i.e., converts some data structures to other data structures.

Data structures can be converted into each other to a certain degree, i.e., whenever the conversion does not violate any of the constraints of the target data structure. Conversion is performed similar to the conversion of data types with the function as.DATASTRUCTRUE(), where DATASTRUCTRUE denotes the target data structure. To query the current data structure use is.DATASTRUCTURE(), which returns a logical value. Alternatively, and more universal, class() can be used.

Data structures follow a more or less logical order from simple but constrained to complex but lax:

vector to matrix (to array) to list to data frame.

There are further data structures, commonly encountered in R. The most common and important ones are S4-objects. Actually, a user can define its own data structure simply by renaming the data structure using class(). However, all these “derivative” data structures, including S4-objects consist of the five structures listed above.

Vector. Vectors are the most primitive data structure in R. They are one-dimensional objects (m:1) of length m, i.e., they contain m elements organised as m rows in one column. There are no zero-dimensional (scalars) data structures in R, an object of length 1 is still a vector. Vectors can either be atomic vectors or lists. This is a bit confusing, because lists are a data structure by themselfes. Throughout this book the term vector is used to refer to atomic vectors, not lists. To test if an object is a vector, use the function is.vector(). However, to be sure, it is more appropriate to use either is.atomic() or is.list(). Vectors can be of any data type but this type must be used consistenty. This means, a vector cannot be a vector if it contains a mix of numeric and character values. Accordingly, if a data structure with mixed data types is converted into a vector, the data type is coerced. Vectors are created with functions like c() (combine or concatanate) or seq() (sequence). The number of elements of a vector (its length) can be querried and changed with length(). Each element of a vector can be labelled with names. These names can be querried with names().
Factor. Factors denote vectors with nominal values. Each element of a factor is stored as an integer value. Additionally, a factor comprises a further internal vector that denotes the names associated with the integer values. Thus, factors are actually integer vectors with the values referring to the set of provided names. Factors can be useful when dealing with large amounts of categoric data but they can also be a pain in the back when importing, for example ASCII data and not thinking of specifically ticking the option to convert (or not convert) the imported data to factors. The usual consequences are unexpected behaviours of functions like c(), nchar() or sum().
Matrix. Adding one more dimension leads to a matrix, i.e., two-dimensional data structures (m:n). They consist of values organised in m rows and n columns. Thus, their length is the product of m and m. Matrices are a frequently used data structure for many algebraic operations with accordingly optimised algorithms (like t(), transpose, and diag() , matrix diagonale). To query the number of rows use nrow(), the number of columns is returned by ncol() and length() provides the total number of elements. To test if an object is a matrix use is.matrix(). Like vectors, matrices can only be of one consistent data type. Matrices are created from scratch with matrix(). They can also be created by binding vectors, either row-wise (rbind()) or column-wise (cbind()). Likewise, it is possible to use dim() to convert matrices from other data structures. Names of matrix rows are queried and set by rownames() and likewise for columns with colnames().
Data frame.
List.
S4-object.

Indexing R-objects

Importing/reading data

Building blocks

Logical (Boolean) operators

Logical operations (“Is x greater than y?”) are possible with all data types. They are performed with Boolean operators and will usually return logical values (TRUE or FALSE). In R, Boolean operators are different from their literal version as it is usually implemented in spreadsheet software:

Logical operator	Implementation in R	Example
greater than	`>`	`1 > 2`
smaller than	`<`	`2 < 1`
greater or equal	`>=`	`FALSE >= 0`
smaller or equal	`<=`	`"a" <= "b"`
equals	`==`	`0 == round(0.1, 0)`
AND	`&` or `&&`	`1:5 > 1 & 1:5 <= 3`
OR	`\|` or `\|\|`	`x == TRUE \| x > 0`
negation	`!`	`"a" != "b"`

Note that AND and OR can be applied in two different forms, using & and using &&. The operators with a single symbol are the vectorised versions. Thus, they evaluate the operation for each element of the input vectors and return a corresponding output vector. The operators with two symbols evaluate only the first element of a vector:

1:5 < 4 & 1:5 > 2

## [1] FALSE FALSE  TRUE FALSE FALSE

1:5 < 4 && 1:5 > 2

## [1] FALSE

loops, BOOLEAN algebra, conditionals

Good practice

A clean start

Structuring code

Commenting

Clearing the workbench

Useful functions

quote Hadley that it is good to read R code (your own, from books and from forums) like you would read a journal article or newspaper articles to stay awake, learn new functions and discover other peoples approaches.

There is a fundamendally basic body of functions that occur in almost every R script of function. These are functions you should have in your mind, along with their arguments and the way they work, without needing to look at their documentation. This must be your basic vocabulary, your everyday communication toolbox.

Tweaking and optimising

Converting data structures

Sometimes it becomes necessary to convert data structures between each other. While there is always the possibility to do this in a loop, filling the predefined output object, there are also more elegant ways to do this.

Vector to matrix to vector

To convert a vector to a matrix, it can simply be bind (rbind or cbind). The other way around, depends on what is desired. If the entire matrix shall be converted to a vector, use as.numeric (or convert to any other data type, see data types). This will convert the matrix column-wise (i.e., combine all values from column 1 to column m). If the matrix needs to be converted row-wise, transpose it before. If only selected rows of the matrix shall be converted, index them (see data structures).

Since a data frame can be very similar to a matrix and can always be created from the latter, conversions of the style vector to data frame to vector are an analogue to the above.

## create example data set
X <- rbind(1:5, 6:10)

## convert matrix to vector (column-wise)
x <- as.numeric(X)

## convert matrix to vector (row-wise)
x <- as.numeric(t(X))

## convert only the first row
x <- X[1,]

Vector to list to vector

Vectors can be converted to lists either with list() or as.list(), which has fundamentally different results. While list() returns a list of length one with the input vector being the one and only list element, as.list() returns a list of the same length as the input vector but each element of the list is the ith element of the vector.

To collapse all elements of a list to one vector, use unlist(). To transfer the list content element-wise use do.call() (see below).

## create example data set
x <- 1:3

## convert vector to list with one element
list(x)

## [[1]]
## [1] 1 2 3

## convert vector to list by elements
as.list(x)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

## collapse list content to one vector
unlist(list(x))

## [1] 1 2 3

Matrix to list to matrix

To convert the rows or columns of a matrix to a list, it is easiest (and fastest) to convert the matrix to a data frame first and then the data frame to a list. If the matrix is supposed to be converted row-wise, it can simply be transposed before converting it to a data frame.

The corresponding back-conversion uses the function do.call(), provided with the operation to be performed (in this case either rbind or cbind) and the object to which the action shall be applied.

## create example data set
X <- matrix(data = 1:10, 
            nrow = 5)

## convert matrix col-wise to list
X_col <- as.list(as.data.frame(X))

## convert matrix row-wise to list
X_row <- as.list(as.data.frame(t(X)))

## convert list to matrix, colwise
X <- do.call(cbind, X_col)

## convert list to matrix, colwise
X <- do.call(rbind, X_row)

Fighting the loops

Vectorisation

One of the most widely used sentences among the R community is to vectorise your code instead of using loops. Although for()-loops appear reasonable from a user’s view they are pretty slow and, as Hadley Wickham puts it, not very expressive. R has specified alternatives to for()-loops, each designed for its specific task and the data structure to handle.

Some operations are already vectorised without explicity thinking of them. To add, for example the elements of two vectors a <- 1:10 and b <- 1:10 one could write c <- numeric(10); for(i in 1:10) {c[i] <- a[i] + b[i]} but intuitively one would simple write a + b, the vectorised form.

For most structures more complex than vectors, there is the apply()-family. Sadly, it is apply() itself that is not really a vectorised solution but a wrapper for for()-loops (Don’t believe it? Type apply in the R console to inspect the code). The really vectorised alternatives to apply() are introduced in the chapter The apply family ordered by the task one wishes to carry otu with them rather than by their name.

For some of the most commonly encountered tasks like calculating row-wise or column-wise sums or averages there prepared vectorised functions that should or could be used instead of loops. They are usually comparaly fast as correctly used apply()-variants but most certainly faster than using apply() or using loops. Some of those specified functions are discussed in the chapter Some vectorised functions for specific puropse.

Here is a brief example comparing the evaluation times of calculating the row-wise sums using different approaches. It uses the function microbenchmark from the microbenchmark package to infer the computation time:

## create example data set
X <- matrix(data = runif(10^4), 
            nrow = 100)

## define the for()-loop approach
f1 <- function(X) {
  x <- numeric(length = nrow(X))
  for(i in 1:length(x)) {
    x[i] <- sum(X[i,])
  }
  return(x)
}

## define the apply function
f2 <- function(X) {
  x <- apply(X = X, 
             MARGIN = 1, 
             FUN = sum)
  return(x)
}

## define the rowSums function
f3 <- function(X) {
  x <- rowSums(X)
  return(x)
}

## define the lapply function
X_list <- as.list(as.data.frame(X))
f4 <- function(X_list) {
  x <- lapply(X = X_list, FUN = sum)
  return(x)
}

## perform benchmark of all four approaches
t <- microbenchmark::microbenchmark(f1(X), 
                               f2(X), 
                               f3(X), 
                               f4(X_list))

print(t)

## Unit: microseconds
##        expr     min       lq      mean   median       uq      max neval
##       f1(X) 165.619 179.3165 266.59347 193.6740 239.5990 3433.721   100
##       f2(X) 197.919 214.2525 341.17391 235.8895 320.9185 2335.912   100
##       f3(X)  33.974  36.2915  42.75499  37.8380  43.5080  143.876   100
##  f4(X_list)  51.107  55.9160  72.52390  63.6040  77.1620  176.262   100

Note that for the last approach the matrix needs to be converted first to a data frame and then to a list. If this operation needs to be done each time the row-wise sum is to be evaluated then this approach will be the slowest of all! However, when the data is in a proper structure, lapply() is almost as fast as rowSums() and both are factor 3 to 7 faster than the for()-loop and apply(). Anyhow, this example is close to nonsense because calculating row-wise sums is easy. But if you need to do some more sophisticated evaluations then rowSums() will be not enough and you need to consider some of the three other approaches.

The apply family

As noted above, the apply-family includes a series of functions that are handy for real vectorisation of tasks but not apply()! However, apply() might still be handy when you wish to work with matrices and don’t want to convert them into other, more suitable data structures. Beyond apply() there are the following members:

rapply(), I have not found a useful application for this function, yet tapply(), apply a function to subsets of a vector, defined by a second vector

Some of the examples I found on stackoverflow

matrix – manipulation – vector

Usually this task is intended to be performed row-wise or column-wise. For simple tasks there are predefined functions such as rowSums() and colSums(), rowMean() and colMean(), and so on. A more generic way is to use the apply()-function. It requires the input matrix, the MARGIN (i.e., whether the matrix shall be manipulated row-wise 1 or column-wise 2) and the function FUN to be applied, as well as optional further function arguments.

## create example data set
X <- matrix(data = 1:10, 
            nrow = 5)

## calculate row-wise means while ignoring NA-values
apply(X = X,
      MARGIN = 1,
      FUN = mean,
      na.rm = TRUE)

## [1] 3.5 4.5 5.5 6.5 7.5

matrix – manipulation – data frame

There is no function to do this job. If a matrix shall be manipulated and returned as a data frame, this has to be done in two steps: i) manipulation and output as matrix and ii) conversion of the matrix to a data frame.

IS IT NOT FASTER TO FIRST CONVERT TO DATA FRAME AND THEN USE AN APPROPRIATE APPLY-FUNCTION?

## create example data set
X <- matrix(letters[1:10], 
            nrow = 5, 
            byrow = TRUE)

## sort data set column-wise
X_sort <- apply(X = X, 
                MARGIN = 2, 
                FUN = sort)

## convert matrix to data frame
X_sort <- as.data.frame(x = X)

matrix – manipulation – list

Similar to

data frame – manipulation – vector

data frame – manipulation – matrix

data frame – manipulation – data frame

data frame – manipulation – list

list – manipulation – vector

sapply() or, more clumsy, unlist(lapply())

vapply() is similar to sapply() but can be tweaked to be faster when providing information about the expected output structure.

list – manipulation – matrix

list – manipulation – data frame

list – manipulation – list

lapply()

A special case is using two lists as input and returning a list object.

## input list A, just a list of vectors
A <- list(c(1, 2, 3),
          c(4, 5, 6))

## input list B, scalars used for multiplication
B <- list(1,
          2)

## list B is element-wise applied to list A 
C <- mapply(FUN = function(X, Y) {
  
      as.list(as.data.frame(X * Y))}, 
      X = A, 
      Y = B)

Some vectorised functions for specific puropse

rowSums, rowMeans, rowMedians, all the matrixStats functions

running or rolling functions

Multi-core environment

Writing a function

Creating a package

Visualisation

Plots in general

Using colours

Interactive plots

Animations and 3D

Statistics

Spatial data

Time series

Time and date formats

POSIX format is a standardised format and by definition an 8 Bit long character value. The POSIX (Portable Operating System Interface) standard has been introduced to guarantee compatibility between different operating systems.

check these functions/types:

ISOdate, ISOdatetime, strftime, strptime, date difftime julian, months, quarters, weekdays

strptime() to cenvert between different character representations of these types.

strftime() (wrapper for format.POSIXlt and format.POSIXct) to extract parts of POSIX times and return them as character strings. Further wrappers for convenience are weekdays() julian(), quarters().

Signal processing

Caleidoscope

Converting a sound file to a wordcloud

Would it not be great to visualise the summary of a speech or lecture? All that is needed are two R-packages and a few minutes of patience. Thanks to my work I found this website when I googled for how to make sound files from seismic records (by th ewqy, another chapter): https://cran.r-project.org/web/packages/transcribeR/vignettes/Transcribing_audio_with_transcribeR.html

Following this vignette we end up with a script like this one.

## load libraries
library("transcribeR")
library("wordcloud")

## set key and input data
API_KEY <- "61c3d762-3fdd-445e-841a-baa460f2b26c"
WAV_DIR <- "~/Desktop/SPEECH/input/"
CSV_LOCATION <- "~/Desktop/SPEECH/output/text_out.txt"

## send audio file to the web
sendAudioGetJobs(wav.dir = WAV_DIR,
                 api.key = API_KEY,
                 csv.location = CSV_LOCATION)

## be patient, actually much more than the 60 s suggest
Sys.sleep(60)

## check/retrieve the transcibed text
x <- retrieveText(job.file = CSV_LOCATION,
                  api.key = API_KEY)

## make the wordcloud of the text
w <- wordcloud(words = x$TRANSCRIPT)

## Loading required package: tm

## Loading required package: NLP

Now let us pull this a bit apart. We need the libraries for sending and retrieving the words from speech recognition (library("transcribeR")) and for building a word cloud from a text file (library("wordcloud")).

transcribeR simply manages the task of sending a sound file in the *.wav-format to the HPE website that does the conversion and queries if the file is processes to return it then. In order to use transcribeR you need an account at the HPE website (https://www.havenondemand.com/signup.html). Then you will get the recuired API key.

Next, your .wav-files to be processed must be present in an input directory. Note, you don’t specify a file but a directory. So make sure the directory only contains what you wish to be transcribed. If you only have .mp3-files… Well, there are many websites that handle the conversion job, e.g., http://online-audio-converter.com/ or whatever software on your computer.

Next, specify a text file, where the transcritpion output will be stored. Usually this will be an empty file. transcribeR will write some header information and then the transcribed words when you evaluate these two functions with a significant pause between them. Allow for as much time as the soundfile plays for upload, transscription, post-processing and so on.

Finally, the wordcloud can be created from the transcription part, being isolated by x$TRANSCRIPT.

sendAudioGetJobs(wav.dir = WAV_DIR,
                 api.key = API_KEY,
                 interval = "-1",
                 encode = "multipart",
                 existing.csv = NULL,
                 csv.location = CSV_LOCATION,
                 language = "en-US",
                 verbose = TRUE)

x <- retrieveText(job.file = CSV_LOCATION,
             api.key = API_KEY)

The function sendAudioGetJobs() also allows other than American English to be transcribed. Use printLanguages() to see the supported languages. Reading the documentation for usage of the argument interval might also be worth the time. The wordcloud can also be modified, e.g., in terms of colours, the minimum count of a word to be included, the overall number of words, etc. By the way, I used the brilliant speech of George W. Bush (wow I made five typos while typing this name) about the ultimatum to Saddam Hussein for this wordcloud.

Now, the rest is imagination. For example writing a wrapper that submits snippets of speech in 30 s slices and builds a wordcloud as a speech is being held. Or summarising a lecture of 90 minutes in just a few words. Or simply map the essence of a card game evening with friends – given they all speek clear enough, in a language supported by transcribeR.

An other R cook book

Michael Dietze