6 Scraping of tables & dynamic websites
6.1 Scraping of tables
In web scraping, we will often pursue the goal of transferring the extracted
data into a tibble or data frame in order to be able to analyse it further. It
is particularly helpful if the data we are interested in is already stored in an
HTML table. Because rvest allows us to read out complete tables quickly and
easily with the function html_table()
.
As a reminder, the basic structure of the HTML code for tables is as follows:
<table>
<tr> <th>#</th> <th>Tag</th> <th>Effect</th> </tr>
<tr> <td>1</td> <td>"b"</td> <td>bold</td> </tr>
<tr> <td>2</td> <td>"i"</td> <td>italics</td> </tr>
</table>
The <table>
tag covers the entire table. Rows are defined by <tr>
, column
headings with <th>
and cells with <td>
.
Before we start scraping, we load the necessary packages as usual:
library(tidyverse)
library(rvest)
6.1.1 Table with CSS selectors from Wikipedia
On the Wikipedia page on “CSS”, there is also a table with CSS selectors. This is our scraping target.
First we parse the website:
<- "https://en.wikipedia.org/wiki/CSS" %>%
website read_html()
If we look at the source code and search – CTRL+F – for “<table”, we see that this page contains a large number of HTML tables. These include not only the elements that are recognisable at first glance as “classic” tables, but also, among other things, the “info boxes” at the top right edge of the article or the fold-out lists of further links at the bottom. If you want to look at this more closely, the Web Developer Tools can be very helpful here.
Instead of simply selecting all <table>
nodes on the page, one strategy might
be to use the WDTs to create a CSS selector for that specific table:
"table.wikitable:nth-child(28)"
. We thus select the table of class
"wikitable"
which is the 28th child of the parent hierarchy level –
<div class="mw-parser-output">
.
If we only want to select a single HTML element, it can be helpful to use the
function html_node()
instead of html_nodes()
.
<- website %>%
node html_node(css = "table.wikitable:nth-child(28)")
<- website %>%
nodes html_nodes(css = "table.wikitable:nth-child(28)")
nodes## {xml_nodeset (1)}
## [1] <table class="wikitable"><tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th ...
node## {html_node}
## <table class="wikitable">
## [1] <tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th>\n<th>First defined<br>i ...
The difference is mainly in the output of the function. This is recognisable by
the entry inside the { } in the output. In the first case, we get a list of HTML
elements – an “xml_nodeset” – even if this list, as here, consists of only one
entry. html_node()
returns the HTML element itself – the “html_node” – as
the function’s output. Why is this relevant? In many cases it can be easier to
work directly with the HTML element instead of a list of HTML elements, for
example when transferring tables into data frames and tibbles, but more on that
later.
To read out the table selected in this way, we only need to apply the function
html_table()
to the HTML node.
<- node %>%
css_table_df html_table()
%>%
css_table_df head(n = 4)
## Pattern
## 1 E
## 2 E:link
## 3 E:active
## 4 E::first-line
## Matches
## 1 an element of type E
## 2 an E element is the source anchor of a hyperlink of which the target is not yet visited (:link) or already visited (:visited)
## 3 an E element during certain user actions
## 4 the first formatted line of an E element
## First definedin CSS level
## 1 1
## 2 1
## 3 1
## 4 1
The result is a data frame that contains the scraped contents of the HTML table
and adopts the column names stored in the <th>
tags for the columns of the
data frame. Due to the very long cells in the “Matches” column, the output of
the RStudio console is unfortunately not particularly helpful. Another advantage
of using tibbles instead of data frames is that long cell contents are
automatically abbreviated in the output. To convert a data frame into a tibble,
we can use the function as_tibble()
.
<- node %>%
css_table_tbl html_table() %>%
as_tibble()
%>%
css_table_tbl head(n = 4)
## # A tibble: 4 x 3
## Pattern Matches `First definedin CSS…
## <chr> <chr> <int>
## 1 E an element of type E 1
## 2 E:link an E element is the source anchor of a hype… 1
## 3 E:active an E element during certain user actions 1
## 4 E::first-l… the first formatted line of an E element 1
6.1.2 Scraping multiple tables
It could also be our scraping goal to scrape not only the first, but all four
content tables of the Wikipedia article. If we look at the four tables in the
source code and/or the WDTs, we see that they all carry the class "wikitable"
.
This allows us to select them easily. Please note that the function
html_nodes()
must be used again, as we no longer need just one element, but a
list of several selected elements.
<- website %>%
tables html_nodes(css = "table.wikitable") %>%
html_table()
The result is a list of four data frames, each of which contains one of the four tables. If we want to select an individual data frame from the list, for example, to transfer it into a new object, we have to rely on subsetting.
We have learned about basic subsetting for vectors using [#]
, in chapter
2. For lists, things can get a little bit more complicated.
There are basically two ways of subsetting lists in R: list_name[#]
and
list_name[[#]]
. The most relevant difference for us is what kind of object R
returns to us. In the first case, the returned object is always a list, even if
it may only consist of one element. Using double square brackets, on the other
hand, returns a single element directly. So the difference is not dissimilar to
that between html_nodes()
and html_node()
.
For example, if our goal is to select the third data frame from the list of four data frames, which subsetting should we use?
3] %>%
tables[str()
## List of 1
## $ :'data.frame': 7 obs. of 2 variables:
## ..$ Selectors : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
3]] %>%
tables[[str()
## 'data.frame': 7 obs. of 2 variables:
## $ Selectors : chr "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## $ Specificity: chr "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
In the first case, we see that we have a list of length 1, which contains a data
frame with 7 lines and 2 variables, as well as further information about these
variables. In the second case, we get the data frame directly, i.e. no longer as
an element of a list. So we have to use list_name[[]]
to directly select a
single data frame from a list of data frames.
If we are interested in selecting several elements from a list instead, this is
only possible with list_name[]
. Instead of selecting an element with a single
number, we can select several with a vector of numbers in one step.
c(1, 3)] %>%
tables[str()
## List of 2
## $ :'data.frame': 42 obs. of 3 variables:
## ..$ Pattern : chr [1:42] "E" "E:link" "E:active" "E::first-line" ...
## ..$ Matches : chr [1:42] "an element of type E" "an E element is the source anchor of a hyperlink of which the target is not yet visited (:link) or already visited (:visited)" "an E element during certain user actions" "the first formatted line of an E element" ...
## ..$ First definedin CSS level: int [1:42] 1 1 1 1 1 1 1 1 1 1 ...
## $ :'data.frame': 7 obs. of 2 variables:
## ..$ Selectors : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
## ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...
As a result, we get a list again that contains the two elements selected here.
6.1.3 Tabellen mit NAs
What happens when we try to read a table with missing values? Consider the following example: https://webscraping-tures.github.io/table_na.html
At first glance, it is already obvious that several cells of the table are unoccupied here. Values are missing. Let’s try to read in the table anyway.
<- "https://webscraping-tures.github.io/table_na.html" %>%
table_na %>%
read_html html_node(css = "table")
%>%
table_na html_table()
## Error: Table has inconsistent number of columns. Do you want fill = TRUE?
We get a helpful error message informing us that the number of columns is not
constant across the entire table. We are also kindly offered a possible solution
directly. The function html_table()
can be instructed with the argument
fill = TRUE
to automatically fill rows with a different number of columns with
NA
. This stands for “Not Available” and represents missing values in R.
<- table_na %>%
wahlbet html_table(fill = TRUE)
%>%
wahlbet head(n = 4)
## Bundesland Wahljahr Wahlbeteiligung
## 1 Baden-Württemberg 2016.0 NA
## 2 Bayern 2018.0 NA
## 3 Berlin NA 66.9
## 4 Brandenburg 61.3 NA
As we can see, html_table
was able to fill the four cells with missing values
with NA
and read the table despite the problems. However, there are two
different types of problems in the HTML source code, which the automatic repair
handles differently. Let’s first look at the source code of the first two lines:
<tr>
<td>Baden-Württemberg</td>
<td>2016</td>
<td></td>
</tr>
<tr>
<td>Bayern</td>
<td>2018</td>
</tr>
For “Baden-Württemberg”, we see that the third column is created in the source
code, but there is no content in this cell. html_table()
could have read this
without fill = TRUE
and would have filled the cell with a NA
. In contrast,
for “Bayern” the cell is completely missing. This means that the second row of
the table consists of only two columns, while the rest of the table has three
columns. This is the problem that the error message pointed out to us. As a
result, R was able to draw the correct conclusion in both cases and fill the
cell in both rows with an NA
.
But let’s also look at the third and fourth rows in the source code:
<tr>
<td>Berlin</td>
<td></td>
<td>66.9</td>
</tr>
<tr>
<td>Brandenburg</td>
<td>61.3</td>
</tr>
The second column is missing in both cases. In the first case it is created but
empty, in the second it does not exist. In the first case, html_table()
can
again handle it without any problems. For “Brandenburg”, however, the function
reaches its limits. We, as human observers, quickly realise that the last state
election in Brandenburg did not take place in 61.3 and that this must therefore
be the turnout. R cannot distinguish this so easily and takes 61.3 as the value
for the column “Election year” and inserts a NA
for “Voter turnout”.
What to do? First of all, we should be aware that such problems exist. So if
html_table()
gives this error message, we should not simply set fill = TRUE
,
but try to find out why the problem exists and whether the option to have it
fixed automatically will actually get us there. If this is not the case, one
approach could be to write our own extractor function that fixes the problems
directly during scraping. However, this is a rather advanced method and outside
the scope of what is feasible in this introduction. However, we can at least
correct the problems that arise afterwards.
Our problem lies exclusively in row four. Its second column must be moved to the
third and the second must then itself be set as NA
. For this we need
subsetting again. In the case of a data frame, we need to specify the row and
column in the form df[row, column]
to select a cell. So we can tell R: “Write
in cell three the content of cell two, and then write in cell two NA
”.
4, 3] <- wahlbet[4, 2]
wahlbet[4, 2] <- NA
wahlbet[
%>%
wahlbet head(n = 4)
## Bundesland Wahljahr Wahlbeteiligung
## 1 Baden-Württemberg 2016 NA
## 2 Bayern 2018 NA
## 3 Berlin NA 66.9
## 4 Brandenburg NA 61.3
6.2 Dynamische Websites
In the “reality” of the modern internet, we will increasingly encounter websites that are no longer based exclusively on static HTML files, but generate content dynamically. You know this, for example, in the form of timelines in social media offerings that are generated dynamically based on your user profile. Other websites may generate the displayed content with JavaScript functions or in response to input in HTML forms.
In many of these cases, it is no longer sufficient from a web scraping perspective to read in an HTML page and extract the data you are looking for, as this is often not contained in the HTML source code but is loaded dynamically in the background. The good news is that there are usually ways of scraping the information anyway.
Perhaps the operator of a page or service offers an API (Application Programming Interface). In this case, we can register for access to this interface and then get access to the data of interest. This is possible with Twitter, for example. In other cases, we may be able to identify in the embedded scripts how and from which database the information is loaded and access it directly. Or we use the Selenium WebDriver to “remotely control” a browser window and scrape what the browser “sees”.
However, all of these approaches are advanced methods that are beyond the scope of this introduction.
But in cases where an HTML file is dynamically generated based on input into a HTML form, we can read it using the methods we already know.
6.2.1 HTML forms and HTML queries
As an example, let’s first look at the OPAC catalogue of the Potsdam University Library https://opac.ub.uni-potsdam.de/ in the browser.
If we enter the term “test” in the search field and click on Search, the browser window will show us the results of the search query. But what actually interests us here is the browser’s address bar. Instead of the URL “https://opac.ub.uni-potsdam.de/”, there is now a much longer URL in the form: “https://opac.ub.uni-potsdam.de/DB=1/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test”. The first part is obviously still the URL of the website called up, let’s call this the base URL. However, the part “CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” was added to the end of the URL. This is the HTML query we are interested in here. Between the base URL and the query there are one or more components, which in this case may also differ depending on your browser. However, these are also irrelevant for the actual search query. The shortened URL: “https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” leads to the same result.
A query is a request in which data from an HTML form is sent to the server. In response, the server generates a new website, which is sent back to the user and displayed in the browser. In this case, the query was triggered by clicking on the “Search” button. If we understand what the components of the query do, we could manipulate it and use it specifically to have a website of interest created and parsed.
6.2.2 HTML forms
To do this, we first need to take a look at the HTML code of the search form. To understand this, you should display the source code of the page and search for “<form” or use the WDTs to look at the form and its components.
<form action="CMD"
class="form"
name="SearchForm"
method="GET">
...
</form>
HTML forms are encompassed by the <form>
tag. Within the tag, one or more form
elements such as text entry fields, drop-down option lists, buttons, etc. can be
placed.
<form>
itself carries a number of attributes in this example. The first
attribute of interest to us is the method="GET"
attribute. This specifies the
method of data transfer between client and server. It is important to note that
the method “GET” uses queries in the URL for the transmission of data and the
method “POST” does not. We can therefore only manipulate queries if the “GET”
method is used. If no method is specified in the <form>
tag, “GET” is also
used as the default.
The second attribute of interest to us is action="CMD"
. This specifies which
action should be triggered after the form has been submitted. Often the value of
action=
is the name of a file on the server to which the data will be sent and
which then returns a dynamically generated HTML page back to the user.
Let us now look at the elements of the form. For this, the rvest function
html_form()
can be helpful.
"https://opac.ub.uni-potsdam.de/" %>%
read_html() %>%
html_node(css = "form") %>%
html_form()
## <form> 'SearchForm' (GET CMD)
## <select> 'ACT' [1/3]
## <select> 'IKT' [0/13]
## <select> 'SRT' [0/4]
## <input checkbox> 'FUZZY': Y
## <input text> 'TRM':
## <input submit> '': Suchen
The output shows us in the first line again the values for method=
and
action=
as well as the name of the form. The other six lines show that the
form consists of three <select>
and three <input>
elements. We also see the
names of these elements as well as the default value that is sent when the form
is submitted, as long as no other value is selected or entered.
Let’s look at some of these elements. <select>
elements are drop-down lists of
options that can be selected. This is the source code for the first <select>
element in our example:
<select name="ACT">
<OPTION VALUE="SRCH">suchen [oder]
<OPTION VALUE="SRCHA" SELECTED>suchen [und]
<OPTION value="BRWS">Index blättern
</select>
The attribute name="ACT"
defines the elements name, which is used when
transmitting the data from the form via the query. The <option>
tags define
the selectable options, i.e. the drop down menu. <value="">
represents the
value transmitted by the form. The user is being shown the text following the
tag. The default selection is either the first value in the list or – like in
this case – the option with the attribute selected
is being explicitly chosen
as the default.
The three other elements are <input>
tags. Input fields whose specific type is
specified via the attribute type=""
. These can be, for example, text boxes
(type="text"
) or checkboxes (input="checkbox"
), but there are many more
options available. A comprehensive list can be found at:
https://www.w3schools.com/html/html_form_input_types.asp.
Here is the source code for two of the three <input>
elements on the example
page:
<input type="text" name="TRM" value="" size="50">
<input type="submit" class="button" value=" Suchen ">
The first tag is of the type “text”, i.e. a text field, in this case the text
field into which the search term is entered. In addition to the name of the
element, a default value of the field is specified via value=""
. In this case,
the default value is an empty field. The second tag is of the type “submit”.
This is the “Search” button, which triggers the transmission of the form data
via the query by clicking on it.
6.2.3 The query
But what exactly is being transmitted? Let’s look again at the example query from above:
CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test
The value of the action=""
attribute forms the first part of the query and is
appended after the base URL. The value of the attribute tells the server what to
do with the other transmitted data. This is followed by a ?
, which introduces
the data to be transmitted as several pairs of name=""
and value=""
attributes of the individual elements. The pairs are connected with &
.
ACT=SRCHA
thus stands for the fact that the option “SRCHA” has been selected
in the element with the name “ACT”. What the values of the two other <select>
elements “IKT” and “SRT” stand for, you can understand yourself with a look into
the source code or the WDTs. The text entered in the field is transmitted as the
value of the <input type="text">
with the name “TRM”. Here “test”.
The server receives the form data in this way, can then take a decision on the
basis of the action=""
attribute, here “CMD”, how the data is to be processed
and constructs the website accordingly, which it sends back to us and which is
displayed in our browser.
6.2.4 Manipulating the query and scraping the result
Now that we know what the components of the query mean, we can manipulate them. Instead of writing queries by hand, we should use R to combine them for us. We will also encounter the technique of manipulating URLs directly in the R code more often. So we should learn it early.
The function str_c()
assembles the strings listed as arguments, i.e. sequences
of letters, into a single string. Strings stored in other R objects can also be
included. If we have the goal of manipulating both the search method and the
search handle, we could achieve this in this way:
<- "https://opac.ub.uni-potsdam.de/"
base_url <- "SRCHA"
method <- "test"
term
<- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
url## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test"
If we now change the strings stored in the method
and term
objects and
generate the complete URL again, these components of the query are manipulated
accordingly.
<- "SRCH"
method <- "web+scraping"
term
<- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
url## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCH&IKT=1016&SRT=YOP&TRM=web+scraping"
The search method was set to the value “SRCH”, i.e. an “OR” search, the search term to “web scraping”. It is important to note that no spaces may appear in the query and that these are replaced by “+” when the form is submitted. So instead of “web scraping” we have to use the string “web+scraping”.
As an example application, we can now have the server perform an “AND” search for the term “web scraping”, read out the HTML page generated by the server and extract the 10 titles displayed.
<- "https://opac.ub.uni-potsdam.de/"
base_url <- "SRCHA"
method <- "web+scraping"
term
<- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
url## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=web+scraping"
<- url %>%
website read_html()
The search results are displayed as tables in the generated HTML file. The
<table>
tag has the attribute-value combination summary="hitlist"
, which we
can use for our CSS selector:
<- website %>%
hits html_node(css = "table[summary='hitlist']") %>%
html_table() %>%
as_tibble()
%>% head(n=10)
hits ## # A tibble: 10 x 4
## X1 X2 X3 X4
## <lgl> <dbl> <chr> <lgl>
## 1 NA NA "" NA
## 2 NA 1 "Introduction to Data Systems : Building from Python/ Bres… NA
## 3 NA NA "" NA
## 4 NA 2 "An Introduction to Data Analysis in R : Hands-on Coding, … NA
## 5 NA NA "" NA
## 6 NA 3 "Quantitative portfolio management : with applications in … NA
## 7 NA NA "" NA
## 8 NA 4 "Automate the boring stuff with Python : practical program… NA
## 9 NA NA "" NA
## 10 NA 5 "Spotify teardown : inside the black box of streaming musi… NA
This worked, but we see that the table consists mainly of empty rows and cells.
These are invisible on the website, but are used to format the display.
Instead of repairing the table afterwards, it makes more sense to extract only
the cells that contain the information we are looking for. These are the <td>
tags with class="hit"
and the attribute-value combination align="left"
.
On this basis, we can construct a unique CSS selector.
<- website %>%
hits html_nodes(css = "td.hit[align='left']") %>%
html_text(trim = TRUE)
%>% head(n = 5)
hits ## [1] "Introduction to Data Systems : Building from Python/ Bressoud, Thomas. - 1st ed. 2020. - Cham : Springer International Publishing, 2020"
## [2] "An Introduction to Data Analysis in R : Hands-on Coding, Data Mining, Visualization and Statistics from Scratch/ Zamora Saiz, Alfonso. - 1st ed. 2020. - Cham : Springer International Publishing, 2020"
## [3] "Quantitative portfolio management : with applications in Python/ Brugière, Pierre. - Cham : Springer, [2020]"
## [4] "Automate the boring stuff with Python : practical programming for total beginners/ Sweigart, Albert. - 2nd edition. - San Francisco : No Starch Press, [2020]"
## [5] "Spotify teardown : inside the black box of streaming music/ Eriksson, Maria. - Cambridge, Massachusetts : The MIT Press, [2019]"
6.2.5 Additional resources
In order to process this information further and, for example, separate it into data on author, title, year, etc., advanced knowledge in dealing with strings is necessary, which unfortunately goes beyond the scope of this introduction. A good first overview can be found in the chapter “Strings” from “R for Data Science” by Wickham and Grolemund: https://r4ds.had.co.nz/strings.html
The appropriate “cheat sheet” is also recommended: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf