3 RStudio in practice & the tidyverse
3.1 RStudio Workflow
Up until now, we wrote our code directly into the RStudio console, pressed “Enter” and received the desired output. This works but will not satisfy our needs in the long run. The main problem is, that the code we wrote essentially disappears after running it. Imagine that you want to rerun your code a week from now or even tomorrow. Maybe you took notes and can recreate it, but that means a lot of unsatisfying and error prone work. Also, maybe at some point you want to share code with colleagues, fellow students, or the R community in general. At the same time, as our code gets more complex, spans multiple lines and consists of many interdependent blocks of code, you will inevitably run into the situation where you realise you made a mistake or have to change some code at the very beginning of your R session. This would mean, recreating and rerunning most or all of the code you have already written.
These are some of the reasons why we should start writing our code into so called R Scripts.
3.1.1 R Scripts
To create a new R Script, you can click on “File” > “New File” > “R Script”, or more conveniently press “CTRL” + “Shift” + “N” simultaneously. This creates an untitled script that we can write our code into.
Let’s start with something simple by recreating some of the code from last week.
<- 17
a <- 4
b
<- (a + b) * 2
the_answer
the_answer## [1] 42
We assign two numerical values two the objects a
and b
, assign a calculation
based on these objects to the new object the_answer
and prompt R to return its
value to us. Instead of writing the code line by line into the console, now we
write the whole block into the newly created script. We can now run the complete
script by clicking on “Source” in the upper toolbar attached to the script’s
tab. In most cases I prefer running the script line by line though. This allows
full control of the process and enables you to stop in certain lines to
e.g. contemplate what the code is doing, check for errors or change details of
the code before moving on to the next line. You can do this either by clicking
on “Run” in the toolbar or pressing “CTRL” + “Enter” simultaneously. In both
cases, RStudio copies the line of code where your text cursor is currently
residing into the console and runs it for you. The text cursor then conveniently
jumps to the next line in the script. In this way you can quickly run your
script line by line, while having full control over when to stop.
You can decide for yourself what the right approach to running your code is, based on the given situation. But remember that R always assumes that you know what you’re doing. There will be no warning prompts if you are about to overwrite work you have previously done.
When you are done writing your script, you might want to save it to the hard drive, preserving your work for later re-runs or for sharing. By clicking on “File” > “Save” or presing “CTRL” + “S” you can save the file with a name of your choosing. The file extension for R Scripts is always “.R”.
One problem – that you will run into sooner or later – is that you will try to
run incomplete code from a script, most commonly a missing closing bracket. In
this case, RStudio puts the code to be run into the console and begins a new
line, starting with +
, and then nothing happens. R assumes that your code will
continue in a further line and waits for you to enter it after the +
. In most
cases the right approach is to cancel the entered command, fix your code and
re-run it afterwards. To cancel an already entered command, you have to click
into the “Console” tab and press “CTRL” + “C” or “Esc” on your keyboard. The >
prompt will reappear in the console and you can continue with your work.
3.1.2 Projects
In many cases, your work will consist of multiple scripts, data files, graphics saved to the disk or additional output. So it makes sense to assign your files to a place on your hard drive. You can do this “by hand” but a convenient approach might be to use RStudio’s project functionality.
By clicking on “File” > “New Project”, you can start the project creation wizard. If you have already created a folder on your hard drive that shall contain the project, you can click on “Existing Directory”, select the folder and click on “Create Project”. You can also create the folder on the fly by clicking on “New Directory” > “New Project” and then choosing a folder name and the sub-folder where it should be placed, before creating the project.
RStudio will now close all files currently open and switch to your newly created project. The name you chose for the project’s folder will also be its name, seen in RStudio’s title bar. When you look at the “File” tab (lower right), you will also see that you are now in the project’s folder. This is your current working directory, a concept we will talk about momentarily. All scripts you create while working in your project will become a part of it. So when you want to return to continuing your work, you can now click on “File” > “Open Project”. All files opened the last time you worked on the project will be reopened and you will again be in the project’s working directory. This is an easy and convenient way to keep your work tidy.
At this point, I would advise you to create a project for this introduction to web scraping and create R scripts for each chapter as parts of the project. The name and sub-folder you choose is not important from the point of view of functionality, but it should make sense to you.
We should now briefly talk about the working directory. If you try to open or
save a file directly from an R script – without specifying a complete path – R
will always assume you refer to your working directory. If you created a
project, this automatically set the project’s folder as the working directory.
You can always check for your current working directory by entering getwd()
into the console. You can change your current working directory by clicking on
“Session” > “Set Working Directory” > “Choose Directory…” or by using the
function setwd()
with the desired path enclosed by "
as the function’s
argument.
3.2 R packages
The R world is open and collaborative by nature. Besides the packages that come with your R installation – base R – an ever growing number of additional packages, written by professionals and users, is available for download by anyone. Every package is focussed on a specific use case and brings with it a number of functions that enable R to be used for tasks that the original software designers did not have in mind or at the very least provide a smoother user experience in cases where the original base R solutions are more complicated.
The packages, its documentation and various other related information are hosted at CRAN – “Comprehensive R Archive Network”– which you already got to know during the installation of R. If you install a package directly from RStudio, it uses CRAN to find and download the package and the associated files.
3.2.1 Installing and loading packages
To install a package we can use the R function install.packages()
where the
name of the package to be installed is written enclosed by "
between the
parentheses. Normally we do this using the console. Installing packages from
an R script works as well, but as we only need to perform the installation once,
there is no benefit in it. It actually slows things down if we repeat the
installation every time we run a script. At the same time, if we share our
script, it is impolite to force an (re)installation on somebody else.
For this introduction we will focus on the packages of the tidyverse – more on them below. To install the core tidyverse package, you should type:
install.packages("tidyverse")
R will output a lot of information concerning the installation process, and
close with a satisfying DONE (tidyverse)
if everything went according to plan.
Now that the installation is complete, we can load the package. This should
normally be done in the first lines of a script. This way all necessary packages
are loaded at the beginning of running a script and other users that see your
code also immediately see which packages are required. Loading a package is done
with library()
with the name of the package in the parentheses, this time
without the need for enclosing it in "
.
library(tidyverse)
Loading the tidyverse package returns a lot of information to us, some of which we will look at in more detail during the course of this chapter. Please note that not all packages are that verbose in their loading process. Often you will get no output at all which is a good sign, as this also means that the package loaded correctly. If anything goes wrong, R will return an error message.
3.2.2 Namespaces
Looking at the last lines of the returned message when loading the tidyverse
package, we’re informed that there are two conflicts. These arise when two or
more loaded packages include functions with the same name. Here we can see that
the tidyverse package dplyr masks the functions filter()
and lag()
from
the base R package stats. If we would have used filter()
without loading
dplyr, the function from the stats package would have been used. After loading
it, the function from dplyr masks the function from stats and is used instead.
If we had a case where we want to load dplyr, but still use filter()
from
stats, we can still do this by explicitly declaring the namespace which we are
referring to. The namespace basically is a reference for R where to look up the
function we have called. If we just write the function’s name, R looks for it in
the list of loaded packages, which would result in applying filter()
from
dplyr here. But we can tell R to look up the function in another namespace, by
using the notation namespace::function
. So to call filter()
from stats
while the function is masked by the similarly named function from dplyr, we
could write stats::filter()
. As the function will not work without further
arguments, we can’t try this out directly, but the same principle applies to
loading the help files:
::filter()
?dplyr::filter() ?stats
3.3 Tidyverse
While we will use some base R functions throughout this course, our main focus will lie on the tidyverse packages.
The tidyverse is a collection of R packages, all following a shared philosophy concerning the syntax of their functions and the way in which data is represented. We will see how the philosophy underlying the tidyverse can lead to more intuitive R code, especially when using the pipe (%>%), in the next chapter. If you want to learn more about the concept of tidy data, the structure of data representation underlying the tidyverse, a read of the chapter on this concept from “R for Data Science” by Wickham & Grolemund is highly recommended: https://r4ds.had.co.nz/tidy-data.html.
Right now, the core tidyverse consists of eight packages. These are the
packages that are loaded when we type library(tidyverse)
and that are listed
in the corresponding output under “Attaching packages”. As the name suggests,
the packages comprise the core functionalities that define the tidyverse. This
includes reading, cleaning and transforming data, handling certain data types,
plotting graphs and more. Over the course of this introduction to web scraping,
we will make use of several of these packages, so in most chapters we will begin
our scripts with loading the tidyverse package.
Besides the core tidyverse, a number of additional and more specialised packages
are part of the tidyverse and were already installed when you ran
install.packages("tidyverse")
above. Among them, the package rvest is of
special importance to us, as it will be our main tool for web scraping
throughout the course.
For a full list of tidyverse packages and the corresponding descriptions of their functionality, you can visit: https://www.tidyverse.org/packages/
3.3.1 Tibbles
The “tibble” package is part of the core tidyverse and offers an alternative to the data frame data structure that is used in base R to represent data in tabular form. The differences between data frames and tibbles are relatively minor. If you are interested in the details, you can read up on them in this section from “R for Data Science” and the chapter on tibbles in general: https://r4ds.had.co.nz/tibbles.html#tibbles-vs.-data.frame. For now, it will suffice to know that tibbles are used throughout this introduction, but that all examples will also work with the classic data frames.
The syntax to create a tibble is simple. Every column represents a variable,
every row an observation. You should think of the columns as vectors, where the
first position in each vector corresponds to the first observation (row), the
second position in each vector to the second observation, and so on. In this way,
we can create tibbles vector by vector or variable by variable, using the
function tibble()
. We assign a name to the variable followed by =
and the
data to be assigned to the variable. The variable-data pairs are separated by
,
:
tibble(numbers = c(0, 1, 2), strings = c("zero", "one", "two"), logicals = c(FALSE, TRUE, TRUE))
## # A tibble: 3 x 3
## numbers strings logicals
## <dbl> <chr> <lgl>
## 1 0 zero FALSE
## 2 1 one TRUE
## 3 2 two TRUE
For longer code like this, it is advisable to use multiple lines and a more clear formatting to create code that is readable and intuitive:
tibble(
numbers = c(0, 1, 2),
strings = c("zero", "one", "two"),
logicals = c(FALSE, TRUE, TRUE)
)## # A tibble: 3 x 3
## numbers strings logicals
## <dbl> <chr> <lgl>
## 1 0 zero FALSE
## 2 1 one TRUE
## 3 2 two TRUE
R understands that all five lines are part of one command as it evaluates
everything between the opening and closing bracket of the tibbles()
function
together. We just have to make sure, that we don’t miss the closing bracket or a
,
that separates the variable-data pairs. This actually is a main source of
errors and will be high on your list of things to check if something does not
work as planned.
We can also use calculations and functions directly in tibble creation, circumventing the need to assign the results to an object first:
tibble(
numbers = c(1, 2, 3),
roots = sqrt(numbers),
rounded = round(roots)
)## # A tibble: 3 x 3
## numbers roots rounded
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 2 1.41 1
## 3 3 1.73 2
Sidenote: The function sqrt()
takes the square roots of the data it is applied
to.
3.4 Additional resources
When learning R and when using functions and packages that are new to you, you will regularly run into situations where you need help in understanding what is happenning and what you can do. Luckily, there a lot of resources that will help you on your R journey.
You have already learned about the built-in help functionalities of R. Many packages also come with so called vignettes which offer more in-depth introductions to the functionalities of the packages. Let’s see if the tibble package comes with vignettes. To do this we can write:
vignette(package = "tibble")
We get a list of all vignettes avaialable for the specific package. To access a
specific vignette, we also use the vignette()
function, this time with the
specific name of the vignette as the function’s argument:
vignette("types")
You can also always check the CRAN page for the package in question. Here you can access the documentation as well as available vignettes, e.g.: https://cran.r-project.org/web/packages/tibble/index.html.
Another highly recommended resource are the RStudio cheatsheets found at: https://www.rstudio.com/resources/cheatsheets/. These are available for many popular packages and present a comprehensive list of the functions offered by the packages.
The RStudio homepage also offers many more resources for learning R and specific packages, including a number of webinars and tutorial videos available under the menu “Resources”: https://www.rstudio.com/
In general, the internet offers a lot of resources that you can access. One of the most important skills you have to develop as an aspiring R user is to understand the problem you are facing to the best of your abilities and formulate a short but precise google search. In most cases you can assume, that you are not the first or last person to have a specific problem. Someone will have written a blogpost, asked a question on https://stackoverflow.com/, made a video tutorial, and so on. If you can find these resources, you are already halfway there.
There are also a lot of books available on R and RStudio in general, as well as on more specific applications in R. I want to reccomend two of them in particular, both avalaible as paperback or online:
Intro to R for Social Scientists by Jasper Tjaden. An accessible introduction to R that expands on the concepts only touched here. Written for a seminar at the University of Potsdam in summer 2021. Available under: https://jaspertjaden.github.io/course-intro2r/
R Cookbook, 2nd Edition by J.D. Long & Paul Teetor. The book is comprised of recipes for specific tasks, you might want to perform. It is not designed as a course but rather as reference for concrete questions. Available under: https://rc2e.com/
R for Data Science by Hadley Wickham and Garrett Grolemund. An introduction to data science using (almost) exclusively the tidyverse packages. Available under: https://r4ds.had.co.nz/
3.1.3 Comments
You should get into the habit of commenting your code as early as possible. Comments are started with one or multiple
#
. All code following the#
will not be evaluated by R and thus serves as the perfect place to comment on what you were doing and thinking while writing the code. Why do this? When you reopen a script that you have not been working on in a while, it can be hard to understand what you tried to do in the first place. Commented code makes this much easier. This is even more true if you share your code with other people. They may have very different approaches to certain R problems and clearly commented code will help them to quickly understand it. You should see this as a sign of respect towards the time your peers may invest in helping you with your coding problems.If you plan on using
setwd()
in your script, it is a good idea to comment this line before sharing your script. Other people will have different folder structures and will want to decide for themselves. The same goes for all lines that will save something to the hard drive, e.g. data sets or exported graphics. The R and RStudio communities are very welcoming and you will always find people that are willing to lend you their help, so you should return the favour and be polite in your code. This includes writing clear comments and not cluttering anyone’s hard drive with files they may not want to have.