6 Scraping of tables & dynamic websites

6.1 Scraping of tables

In web scraping, we will often pursue the goal of transferring the extracted data into a tibble or data frame in order to be able to analyse it further. It is particularly helpful if the data we are interested in is already stored in an HTML table. Because rvest allows us to read out complete tables quickly and easily with the function html_table().

As a reminder, the basic structure of the HTML code for tables is as follows:

<table>
  <tr> <th>#</th> <th>Tag</th> <th>Effect</th> </tr>
  <tr> <td>1</td> <td>"b"</td> <td>bold</td> </tr>
  <tr> <td>2</td> <td>"i"</td> <td>italics</td> </tr>
</table>

The <table> tag covers the entire table. Rows are defined by <tr>, column headings with <th> and cells with <td>.

Before we start scraping, we load the necessary packages as usual:

library(tidyverse)
library(rvest)

6.1.1 Table with CSS selectors from Wikipedia

On the Wikipedia page on “CSS”, there is also a table with CSS selectors. This is our scraping target.

First we parse the website:

website <- "https://en.wikipedia.org/wiki/CSS" %>% 
  read_html()

If we look at the source code and search – CTRL+F – for “<table”, we see that this page contains a large number of HTML tables. These include not only the elements that are recognisable at first glance as “classic” tables, but also, among other things, the “info boxes” at the top right edge of the article or the fold-out lists of further links at the bottom. If you want to look at this more closely, the Web Developer Tools can be very helpful here.

Instead of simply selecting all <table> nodes on the page, one strategy might be to use the WDTs to create a CSS selector for that specific table: "table.wikitable:nth-child(28)". We thus select the table of class "wikitable" which is the 28th child of the parent hierarchy level – <div class="mw-parser-output">.

If we only want to select a single HTML element, it can be helpful to use the function html_node() instead of html_nodes().

node <- website %>% 
  html_node(css = "table.wikitable:nth-child(28)")
nodes <- website %>% 
  html_nodes(css = "table.wikitable:nth-child(28)")

nodes
## {xml_nodeset (1)}
## [1] <table class="wikitable"><tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th ...
node
## {html_node}
## <table class="wikitable">
## [1] <tbody>\n<tr>\n<th>Pattern</th>\n<th>Matches</th>\n<th>First defined<br>i ...

The difference is mainly in the output of the function. This is recognisable by the entry inside the { } in the output. In the first case, we get a list of HTML elements – an “xml_nodeset” – even if this list, as here, consists of only one entry. html_node() returns the HTML element itself – the “html_node” – as the function’s output. Why is this relevant? In many cases it can be easier to work directly with the HTML element instead of a list of HTML elements, for example when transferring tables into data frames and tibbles, but more on that later.

To read out the table selected in this way, we only need to apply the function html_table() to the HTML node.

css_table_df <- node %>% 
  html_table()

css_table_df %>% 
  head(n = 4)
##         Pattern
## 1             E
## 2        E:link
## 3      E:active
## 4 E::first-line
##                                                                                                                         Matches
## 1                                                                                                          an element of type E
## 2 an E element is the source anchor of a hyperlink of which the target is not yet visited (:link) or already visited (:visited)
## 3                                                                                      an E element during certain user actions
## 4                                                                                      the first formatted line of an E element
##   First definedin CSS level
## 1                         1
## 2                         1
## 3                         1
## 4                         1

The result is a data frame that contains the scraped contents of the HTML table and adopts the column names stored in the <th> tags for the columns of the data frame. Due to the very long cells in the “Matches” column, the output of the RStudio console is unfortunately not particularly helpful. Another advantage of using tibbles instead of data frames is that long cell contents are automatically abbreviated in the output. To convert a data frame into a tibble, we can use the function as_tibble().

css_table_tbl <- node %>% 
  html_table() %>% 
  as_tibble()

css_table_tbl %>% 
  head(n = 4)
## # A tibble: 4 x 3
##   Pattern     Matches                                      `First definedin CSS…
##   <chr>       <chr>                                                        <int>
## 1 E           an element of type E                                             1
## 2 E:link      an E element is the source anchor of a hype…                     1
## 3 E:active    an E element during certain user actions                         1
## 4 E::first-l… the first formatted line of an E element                         1

6.1.2 Scraping multiple tables

It could also be our scraping goal to scrape not only the first, but all four content tables of the Wikipedia article. If we look at the four tables in the source code and/or the WDTs, we see that they all carry the class "wikitable". This allows us to select them easily. Please note that the function html_nodes() must be used again, as we no longer need just one element, but a list of several selected elements.

tables <- website %>% 
  html_nodes(css = "table.wikitable") %>% 
  html_table()

The result is a list of four data frames, each of which contains one of the four tables. If we want to select an individual data frame from the list, for example, to transfer it into a new object, we have to rely on subsetting.

We have learned about basic subsetting for vectors using [#], in chapter 2. For lists, things can get a little bit more complicated. There are basically two ways of subsetting lists in R: list_name[#] and list_name[[#]]. The most relevant difference for us is what kind of object R returns to us. In the first case, the returned object is always a list, even if it may only consist of one element. Using double square brackets, on the other hand, returns a single element directly. So the difference is not dissimilar to that between html_nodes() and html_node().

For example, if our goal is to select the third data frame from the list of four data frames, which subsetting should we use?

tables[3] %>% 
  str()
## List of 1
##  $ :'data.frame':    7 obs. of  2 variables:
##   ..$ Selectors  : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##   ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

tables[[3]] %>% 
  str()
## 'data.frame':    7 obs. of  2 variables:
##  $ Selectors  : chr  "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##  $ Specificity: chr  "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

In the first case, we see that we have a list of length 1, which contains a data frame with 7 lines and 2 variables, as well as further information about these variables. In the second case, we get the data frame directly, i.e. no longer as an element of a list. So we have to use list_name[[]] to directly select a single data frame from a list of data frames.

If we are interested in selecting several elements from a list instead, this is only possible with list_name[]. Instead of selecting an element with a single number, we can select several with a vector of numbers in one step.

tables[c(1, 3)] %>% 
  str()
## List of 2
##  $ :'data.frame':    42 obs. of  3 variables:
##   ..$ Pattern                  : chr [1:42] "E" "E:link" "E:active" "E::first-line" ...
##   ..$ Matches                  : chr [1:42] "an element of type E" "an E element is the source anchor of a hyperlink of which the target is not yet visited (:link) or already visited (:visited)" "an E element during certain user actions" "the first formatted line of an E element" ...
##   ..$ First definedin CSS level: int [1:42] 1 1 1 1 1 1 1 1 1 1 ...
##  $ :'data.frame':    7 obs. of  2 variables:
##   ..$ Selectors  : chr [1:7] "h1 {color: white;}" "p em {color: green;}" ".grape {color: red;}" "p.bright {color: blue;}" ...
##   ..$ Specificity: chr [1:7] "0, 0, 0, 1" "0, 0, 0, 2" "0, 0, 1, 0" "0, 0, 1, 1" ...

As a result, we get a list again that contains the two elements selected here.

6.1.3 Tabellen mit NAs

What happens when we try to read a table with missing values? Consider the following example: https://webscraping-tures.github.io/table_na.html

At first glance, it is already obvious that several cells of the table are unoccupied here. Values are missing. Let’s try to read in the table anyway.

table_na <- "https://webscraping-tures.github.io/table_na.html" %>% 
  read_html %>% 
  html_node(css = "table")

table_na %>% 
  html_table()
## Error: Table has inconsistent number of columns. Do you want fill = TRUE?

We get a helpful error message informing us that the number of columns is not constant across the entire table. We are also kindly offered a possible solution directly. The function html_table() can be instructed with the argument fill = TRUE to automatically fill rows with a different number of columns with NA. This stands for “Not Available” and represents missing values in R.

wahlbet <- table_na %>% 
  html_table(fill = TRUE)

wahlbet %>% 
  head(n = 4)
##          Bundesland Wahljahr Wahlbeteiligung
## 1 Baden-Württemberg   2016.0              NA
## 2            Bayern   2018.0              NA
## 3            Berlin       NA            66.9
## 4       Brandenburg     61.3              NA

As we can see, html_table was able to fill the four cells with missing values with NA and read the table despite the problems. However, there are two different types of problems in the HTML source code, which the automatic repair handles differently. Let’s first look at the source code of the first two lines:

<tr>
  <td>Baden-Württemberg</td>
  <td>2016</td>
  <td></td>
</tr>

<tr>
  <td>Bayern</td>
  <td>2018</td>
</tr>

For “Baden-Württemberg”, we see that the third column is created in the source code, but there is no content in this cell. html_table() could have read this without fill = TRUE and would have filled the cell with a NA. In contrast, for “Bayern” the cell is completely missing. This means that the second row of the table consists of only two columns, while the rest of the table has three columns. This is the problem that the error message pointed out to us. As a result, R was able to draw the correct conclusion in both cases and fill the cell in both rows with an NA.

But let’s also look at the third and fourth rows in the source code:

<tr>
  <td>Berlin</td>
  <td></td>
  <td>66.9</td>
</tr>

<tr>
  <td>Brandenburg</td>
  <td>61.3</td>
</tr>

The second column is missing in both cases. In the first case it is created but empty, in the second it does not exist. In the first case, html_table() can again handle it without any problems. For “Brandenburg”, however, the function reaches its limits. We, as human observers, quickly realise that the last state election in Brandenburg did not take place in 61.3 and that this must therefore be the turnout. R cannot distinguish this so easily and takes 61.3 as the value for the column “Election year” and inserts a NA for “Voter turnout”.

What to do? First of all, we should be aware that such problems exist. So if html_table() gives this error message, we should not simply set fill = TRUE, but try to find out why the problem exists and whether the option to have it fixed automatically will actually get us there. If this is not the case, one approach could be to write our own extractor function that fixes the problems directly during scraping. However, this is a rather advanced method and outside the scope of what is feasible in this introduction. However, we can at least correct the problems that arise afterwards.

Our problem lies exclusively in row four. Its second column must be moved to the third and the second must then itself be set as NA. For this we need subsetting again. In the case of a data frame, we need to specify the row and column in the form df[row, column] to select a cell. So we can tell R: “Write in cell three the content of cell two, and then write in cell two NA”.

wahlbet[4, 3] <- wahlbet[4, 2]
wahlbet[4, 2] <- NA

wahlbet %>% 
  head(n = 4)
##          Bundesland Wahljahr Wahlbeteiligung
## 1 Baden-Württemberg     2016              NA
## 2            Bayern     2018              NA
## 3            Berlin       NA            66.9
## 4       Brandenburg       NA            61.3

6.2 Dynamische Websites

In the “reality” of the modern internet, we will increasingly encounter websites that are no longer based exclusively on static HTML files, but generate content dynamically. You know this, for example, in the form of timelines in social media offerings that are generated dynamically based on your user profile. Other websites may generate the displayed content with JavaScript functions or in response to input in HTML forms.

In many of these cases, it is no longer sufficient from a web scraping perspective to read in an HTML page and extract the data you are looking for, as this is often not contained in the HTML source code but is loaded dynamically in the background. The good news is that there are usually ways of scraping the information anyway.

Perhaps the operator of a page or service offers an API (Application Programming Interface). In this case, we can register for access to this interface and then get access to the data of interest. This is possible with Twitter, for example. In other cases, we may be able to identify in the embedded scripts how and from which database the information is loaded and access it directly. Or we use the Selenium WebDriver to “remotely control” a browser window and scrape what the browser “sees”.

However, all of these approaches are advanced methods that are beyond the scope of this introduction.

But in cases where an HTML file is dynamically generated based on input into a HTML form, we can read it using the methods we already know.

6.2.1 HTML forms and HTML queries

As an example, let’s first look at the OPAC catalogue of the Potsdam University Library https://opac.ub.uni-potsdam.de/ in the browser.

If we enter the term “test” in the search field and click on Search, the browser window will show us the results of the search query. But what actually interests us here is the browser’s address bar. Instead of the URL “https://opac.ub.uni-potsdam.de/”, there is now a much longer URL in the form: “https://opac.ub.uni-potsdam.de/DB=1/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test”. The first part is obviously still the URL of the website called up, let’s call this the base URL. However, the part “CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” was added to the end of the URL. This is the HTML query we are interested in here. Between the base URL and the query there are one or more components, which in this case may also differ depending on your browser. However, these are also irrelevant for the actual search query. The shortened URL: “https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test” leads to the same result.

A query is a request in which data from an HTML form is sent to the server. In response, the server generates a new website, which is sent back to the user and displayed in the browser. In this case, the query was triggered by clicking on the “Search” button. If we understand what the components of the query do, we could manipulate it and use it specifically to have a website of interest created and parsed.

6.2.2 HTML forms

To do this, we first need to take a look at the HTML code of the search form. To understand this, you should display the source code of the page and search for “<form” or use the WDTs to look at the form and its components.

<form action="CMD"
      class="form"
      name="SearchForm"
      method="GET">
  ...
</form>

HTML forms are encompassed by the <form> tag. Within the tag, one or more form elements such as text entry fields, drop-down option lists, buttons, etc. can be placed.

<form> itself carries a number of attributes in this example. The first attribute of interest to us is the method="GET" attribute. This specifies the method of data transfer between client and server. It is important to note that the method “GET” uses queries in the URL for the transmission of data and the method “POST” does not. We can therefore only manipulate queries if the “GET” method is used. If no method is specified in the <form> tag, “GET” is also used as the default.

The second attribute of interest to us is action="CMD". This specifies which action should be triggered after the form has been submitted. Often the value of action= is the name of a file on the server to which the data will be sent and which then returns a dynamically generated HTML page back to the user.

Let us now look at the elements of the form. For this, the rvest function html_form() can be helpful.

"https://opac.ub.uni-potsdam.de/" %>% 
  read_html() %>% 
  html_node(css = "form") %>% 
  html_form()
## <form> 'SearchForm' (GET CMD)
##   <select> 'ACT' [1/3]
##   <select> 'IKT' [0/13]
##   <select> 'SRT' [0/4]
##   <input checkbox> 'FUZZY': Y
##   <input text> 'TRM': 
##   <input submit> '':  Suchen

The output shows us in the first line again the values for method= and action= as well as the name of the form. The other six lines show that the form consists of three <select> and three <input> elements. We also see the names of these elements as well as the default value that is sent when the form is submitted, as long as no other value is selected or entered.

Let’s look at some of these elements. <select> elements are drop-down lists of options that can be selected. This is the source code for the first <select> element in our example:

<select name="ACT">
  <OPTION VALUE="SRCH">suchen [oder]
  <OPTION VALUE="SRCHA" SELECTED>suchen [und]
  <OPTION value="BRWS">Index bl&auml;ttern
</select>

The attribute name="ACT" defines the elements name, which is used when transmitting the data from the form via the query. The <option> tags define the selectable options, i.e. the drop down menu. <value=""> represents the value transmitted by the form. The user is being shown the text following the tag. The default selection is either the first value in the list or – like in this case – the option with the attribute selected is being explicitly chosen as the default.

The three other elements are <input> tags. Input fields whose specific type is specified via the attribute type="". These can be, for example, text boxes (type="text") or checkboxes (input="checkbox"), but there are many more options available. A comprehensive list can be found at: https://www.w3schools.com/html/html_form_input_types.asp. Here is the source code for two of the three <input> elements on the example page:

<input type="text" name="TRM" value="" size="50">

<input type="submit" class="button" value=" Suchen ">

The first tag is of the type “text”, i.e. a text field, in this case the text field into which the search term is entered. In addition to the name of the element, a default value of the field is specified via value="". In this case, the default value is an empty field. The second tag is of the type “submit”. This is the “Search” button, which triggers the transmission of the form data via the query by clicking on it.

6.2.3 The query

But what exactly is being transmitted? Let’s look again at the example query from above:

CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test

The value of the action="" attribute forms the first part of the query and is appended after the base URL. The value of the attribute tells the server what to do with the other transmitted data. This is followed by a ?, which introduces the data to be transmitted as several pairs of name="" and value="" attributes of the individual elements. The pairs are connected with &. ACT=SRCHA thus stands for the fact that the option “SRCHA” has been selected in the element with the name “ACT”. What the values of the two other <select> elements “IKT” and “SRT” stand for, you can understand yourself with a look into the source code or the WDTs. The text entered in the field is transmitted as the value of the <input type="text"> with the name “TRM”. Here “test”.

The server receives the form data in this way, can then take a decision on the basis of the action="" attribute, here “CMD”, how the data is to be processed and constructs the website accordingly, which it sends back to us and which is displayed in our browser.

6.2.4 Manipulating the query and scraping the result

Now that we know what the components of the query mean, we can manipulate them. Instead of writing queries by hand, we should use R to combine them for us. We will also encounter the technique of manipulating URLs directly in the R code more often. So we should learn it early.

The function str_c() assembles the strings listed as arguments, i.e. sequences of letters, into a single string. Strings stored in other R objects can also be included. If we have the goal of manipulating both the search method and the search handle, we could achieve this in this way:

base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "test"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=test"

If we now change the strings stored in the method and term objects and generate the complete URL again, these components of the query are manipulated accordingly.

method <- "SRCH"
term <- "web+scraping"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCH&IKT=1016&SRT=YOP&TRM=web+scraping"

The search method was set to the value “SRCH”, i.e. an “OR” search, the search term to “web scraping”. It is important to note that no spaces may appear in the query and that these are replaced by “+” when the form is submitted. So instead of “web scraping” we have to use the string “web+scraping”.

As an example application, we can now have the server perform an “AND” search for the term “web scraping”, read out the HTML page generated by the server and extract the 10 titles displayed.

base_url <- "https://opac.ub.uni-potsdam.de/"
method <- "SRCHA"
term <- "web+scraping"

url <- str_c(base_url, "CMD?ACT=", method, "&IKT=1016&SRT=YOP&TRM=", term)
url
## [1] "https://opac.ub.uni-potsdam.de/CMD?ACT=SRCHA&IKT=1016&SRT=YOP&TRM=web+scraping"

website <- url %>% 
  read_html()

The search results are displayed as tables in the generated HTML file. The <table> tag has the attribute-value combination summary="hitlist", which we can use for our CSS selector:

hits <- website %>% 
  html_node(css = "table[summary='hitlist']") %>% 
  html_table() %>% 
  as_tibble()

hits %>% head(n=10)
## # A tibble: 10 x 4
##    X1       X2 X3                                                          X4   
##    <lgl> <dbl> <chr>                                                       <lgl>
##  1 NA       NA ""                                                          NA   
##  2 NA        1 "Introduction to Data Systems : Building from Python/ Bres… NA   
##  3 NA       NA ""                                                          NA   
##  4 NA        2 "An Introduction to Data Analysis in R : Hands-on Coding, … NA   
##  5 NA       NA ""                                                          NA   
##  6 NA        3 "Quantitative portfolio management : with applications in … NA   
##  7 NA       NA ""                                                          NA   
##  8 NA        4 "Automate the boring stuff with Python : practical program… NA   
##  9 NA       NA ""                                                          NA   
## 10 NA        5 "Spotify teardown : inside the black box of streaming musi… NA

This worked, but we see that the table consists mainly of empty rows and cells. These are invisible on the website, but are used to format the display. Instead of repairing the table afterwards, it makes more sense to extract only the cells that contain the information we are looking for. These are the <td> tags with class="hit" and the attribute-value combination align="left". On this basis, we can construct a unique CSS selector.

hits <- website %>% 
  html_nodes(css = "td.hit[align='left']") %>% 
  html_text(trim = TRUE)

hits %>% head(n = 5)
## [1] "Introduction to Data Systems : Building from Python/ Bressoud, Thomas. - 1st ed. 2020. - Cham : Springer International Publishing, 2020"                                                                
## [2] "An Introduction to Data Analysis in R : Hands-on Coding, Data Mining, Visualization and Statistics from Scratch/ Zamora Saiz, Alfonso. - 1st ed. 2020. - Cham : Springer International Publishing, 2020"
## [3] "Quantitative portfolio management : with applications in Python/ Brugière, Pierre. - Cham : Springer, [2020]"                                                                                           
## [4] "Automate the boring stuff with Python : practical programming for total beginners/ Sweigart, Albert. - 2nd edition. - San Francisco : No Starch Press, [2020]"                                          
## [5] "Spotify teardown : inside the black box of streaming music/ Eriksson, Maria. - Cambridge, Massachusetts : The MIT Press, [2019]"

6.2.5 Additional resources

In order to process this information further and, for example, separate it into data on author, title, year, etc., advanced knowledge in dealing with strings is necessary, which unfortunately goes beyond the scope of this introduction. A good first overview can be found in the chapter “Strings” from “R for Data Science” by Wickham and Grolemund: https://r4ds.had.co.nz/strings.html

The appropriate “cheat sheet” is also recommended: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf