7 Scraping of multi-page websites
In many cases, we do not want to scrape the content of a single website, but several sub-pages in one step. In this session we will look at two common variations. Index pages and pagination.
7.1 Index-pages
An index page, in this context, is a website on which links to the various sub-pages are listed. We can think of this as a table of contents.
The website for the tidyverse packages serves as an example:
https://www.tidyverse.org/packages/.
Under the point “Core tidyverse” the eight packages are listed, which are loaded
in R with library(tidyverse)
. In addition to the name and icon, a short
description of the package and a link to further information, are part of the
list.
Let’s look at one of the sub-pages for the core packages. Since they all have the same structure, you can choose any package as an example. It could be our scraping goal to create a table with the names of the core packages, the current version number, and the links to CRAN and the matching chapter in “R for Data Science” by Wickham and Grolemund. By now we have all the tools to extract this data from the websites. We could now “manually” scrape the individual sub-pages and merge the data. It would be more practical, however, if we could start from the index page and scrape all eight sub-pages and the data of interest they contain in one step. This is exactly what we will look at in the following.
7.1.1 Scraping of the index
library(tidyverse)
library(rvest)
As a first step, we need to extract the links to the sub-pages from the source code of the index page. As always, we download the website and parse it.
<- "https://www.tidyverse.org/packages/" %>%
website read_html()
In this case, the links are stored twice in the source code. In one case the
image of the icon is linked, in the other the name of the package. You can
follow this in the source code and/or with the WDTs yourself by now. However, we
need each link only once. One of several ways to select them could be to select
the <a>
tags that directly follow the individual <div class="package">
tags.
<- website %>%
a_nodes html_nodes(css = "div.package > a")
a_nodes## {xml_nodeset (8)}
## [1] <a href="https://ggplot2.tidyverse.org/" target="_blank">\n <img class ...
## [2] <a href="https://dplyr.tidyverse.org/" target="_blank">\n <img class=" ...
## [3] <a href="https://tidyr.tidyverse.org/" target="_blank">\n <img class=" ...
## [4] <a href="https://readr.tidyverse.org/" target="_blank">\n <img class=" ...
## [5] <a href="https://purrr.tidyverse.org/" target="_blank">\n <img class=" ...
## [6] <a href="https://tibble.tidyverse.org/" target="_blank">\n <img class= ...
## [7] <a href="https://stringr.tidyverse.org/" target="_blank">\n <img class ...
## [8] <a href="https://forcats.tidyverse.org/" target="_blank">\n <img class ...
Since we need the actual URLs to be able to read out the sub-pages in the
following, we should now extract the values of the href=""
attributes.
<- a_nodes %>%
links html_attr(name = "href")
links## [1] "https://ggplot2.tidyverse.org/" "https://dplyr.tidyverse.org/"
## [3] "https://tidyr.tidyverse.org/" "https://readr.tidyverse.org/"
## [5] "https://purrr.tidyverse.org/" "https://tibble.tidyverse.org/"
## [7] "https://stringr.tidyverse.org/" "https://forcats.tidyverse.org/"
7.1.2 Iteration with map()
Before starting to parse the sub-pages, we must think about how we can get R to
apply these steps automatically to several URLs one after the other. One
possibility from base R would be to apply a “For Loop”. However, I would like to
introduce the map()
functions family, from the tidyverse package purrr.
These follow the basic logic of the tidyverse, can easily be included in pipes
and have a short and intuitively understandable syntax.
The map()
function takes a vector or list as input, applies a function
specified in the second argument to each of the elements of the input, and
returns to us a list of the results of the applied function.
<- c(1.28, 1.46, 1.64, 1.82)
x
map(x, round)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 2
For each element of the numerical vector x
, map()
individually applies the
function round()
. round()
does what the name suggests, and rounds the input
up or down, depending on the numerical value. As a result, map()
returns a
list.
If we want to have a vector as output, we can use specific variants of the map functions depending on the desired type – logical, integer, double or character. Here is a quote from the help on ?map:
“map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).”
For example, if we want to have a numeric vector instead of a list as output for
the above example, we can use map_dbl()
:
<- c(1.28, 1.46, 1.64, 1.82)
x
map_dbl(x, round)
## [1] 1 1 2 2
Or for a character vector we can apply map_chr()
. The function toupper()
used here, puts returns the input as uppercase letters.
<- c("abc", "def", "gah")
x
map_chr(x, toupper)
## [1] "ABC" "DEF" "GAH"
If we want to change the arguments of the applied function, the arguments are listed after the name of the function. Here, the number of decimal places to be rounded is set from the default value of 0 to 1.
<- c(1.28, 1.46, 1.64, 1.82)
x
map_dbl(x, round, digits = 1)
## [1] 1.3 1.5 1.6 1.8
This gives us an overview of iteration with map()
, but this can necessarily
only be a first introduction. For a more detailed introduction to For Loops and
the map functions, I recommend the chapter on “Iteration” from “R for Data
Science”:
https://r4ds.had.co.nz/iteration.html
For a more interactive German introduction, I recommend the section “Schleifen”
in the StartR app by Fabian Class:
https://shiny.lmes.uni-potsdam.de/startR/#section-schleifen
7.1.3 Scraping the sub-pages
We can now use map()
to parse all sub-pages in one step. As input, we use the
character vector that contains the URLs of the sub-pages, and as the function to
be applied, the familiar read_html()
. For each of the eight URLs, the function
is applied to the respective URL one after the other. As output we get a list of
the eight parsed sub-pages.
<- links %>%
pages map(read_html)
If we look at the sub-pages in the browser, we can see that the HTML structure is identical for each sub-page in terms of the information we are interested in – name, version number and CRAN as well as “R for Data Science” links. We can therefore extract the data for each of them using the same CSS selectors.
%>%
pages map(html_node, css = "a.navbar-brand") %>%
map_chr(html_text)
## [1] "ggplot2" "dplyr" "tidyr" "readr" "purrr" "tibble" "stringr"
## [8] "forcats"
The name of the package is displayed in the menu bar in the upper section of the
pages. This is enclosed by an <a>
tag. For example, for
https://ggplot2.tidyverse.org/
this is: <a class="navbar-brand" href="index.html">ggplot2</a>
. The CSS
selector used here is one of the possible options to retireve the desired
information.
So what happens in detail in the code shown? The input is the previously created
list with the eight parsed websites. In the second line, by using map()
, the
function html_node()
with the argument css = "a.navbar-brand"
is applied to
each of the parsed pages. For each of the eight pages, the corresponding
HTML-node is selected in turn. These are passed through the pipe to the third
line, where iteration is again performed over each node, this time using the
familiar function html_text()
. For each of the eight selected nodes, the text
between the start and end tag is extracted. Since map_chr()
is used here, a
character vector is returned as output.
%>%
pages map(html_node, css = "span.version.version-default") %>%
map_chr(html_text)
## [1] "3.3.2" "1.0.6" "1.1.3" "1.4.0" "0.3.4"
## [6] "3.1.2" "1.4.0.9000" "0.5.1"
The extraction of the current version number of the packages works the same way.
For ggplot2, these are contained in the following tag:
<span class="version version-default" data-toggle="tooltip" data-placement="bottom" title="Released version">3.3.2</span>
.
This can also be selected easily, but we see an interesting detail. Namely, the
class name here contains a space. This indicates that the <span>
tag carries
both the class version
and version-default
. We can select this by attaching
both class names to span
in the selector with a .
Strictly speaking,
however, we do not need to do this here. Both class names by themselves are
sufficient for selection, as they do not appear anywhere else on the website.
In terms of the most explicit CSS selectors possible, I would still recommend to
use both class names, but this is also a matter of taste.
%>%
pages map(html_node, css = "ul.list-unstyled > li:nth-child(1) > a") %>%
map_chr(html_attr, name = "href")
## [1] "https://cloud.r-project.org/package=ggplot2"
## [2] "https://cloud.r-project.org/package=dplyr"
## [3] "https://cloud.r-project.org/package=tidyr"
## [4] "https://cloud.r-project.org/package=readr"
## [5] "https://cloud.r-project.org/package=purrr"
## [6] "https://cloud.r-project.org/package=tibble"
## [7] "https://cloud.r-project.org/package=stringr"
## [8] "https://cloud.r-project.org/package=forcats"
%>%
pages map(html_node, css = "ul.list-unstyled > li:nth-child(4) > a") %>%
map_chr(html_attr, name = "href")
## [1] "https://r4ds.had.co.nz/data-visualisation.html"
## [2] "http://r4ds.had.co.nz/transform.html"
## [3] "https://r4ds.had.co.nz/tidy-data.html"
## [4] "http://r4ds.had.co.nz/data-import.html"
## [5] "http://r4ds.had.co.nz/iteration.html"
## [6] "https://r4ds.had.co.nz/tibbles.html"
## [7] "http://r4ds.had.co.nz/strings.html"
## [8] "http://r4ds.had.co.nz/factors.html"
The extraction of the links also follows the same basic principle. The selector
is a little more complicated, but can easily be understood using the WDTs. We
select the <a>
tags of the first and fourth <li>
children of the unordered
list with the class list-unstyled
. Here we apply the function html_attr()
with the argument name = "href"
to each of the eight selected nodes to get the
data of interest, the URLs of the links.
If we are only interested in the final result, we can also extract the data of the sub-pages directly during the creation of a tibble:
tibble(
name = pages %>%
map(html_node, css = "a.navbar-brand") %>%
map_chr(html_text),
version = pages %>%
map(html_node, css = "span.version.version-default") %>%
map_chr(html_text),
CRAN = pages %>%
map(html_node, css = "ul.list-unstyled > li:nth-child(1) > a") %>%
map_chr(html_attr, name = "href"),
Learn = pages %>%
map(html_node, css = "ul.list-unstyled > li:nth-child(4) > a") %>%
map_chr(html_attr, name = "href")
)## # A tibble: 8 x 4
## name version CRAN Learn
## <chr> <chr> <chr> <chr>
## 1 ggplot2 3.3.2 https://cloud.r-project.org/… https://r4ds.had.co.nz/data-v…
## 2 dplyr 1.0.6 https://cloud.r-project.org/… http://r4ds.had.co.nz/transfo…
## 3 tidyr 1.1.3 https://cloud.r-project.org/… https://r4ds.had.co.nz/tidy-d…
## 4 readr 1.4.0 https://cloud.r-project.org/… http://r4ds.had.co.nz/data-im…
## 5 purrr 0.3.4 https://cloud.r-project.org/… http://r4ds.had.co.nz/iterati…
## 6 tibble 3.1.2 https://cloud.r-project.org/… https://r4ds.had.co.nz/tibble…
## 7 stringr 1.4.0.90… https://cloud.r-project.org/… http://r4ds.had.co.nz/strings…
## 8 forcats 0.5.1 https://cloud.r-project.org/… http://r4ds.had.co.nz/factors…
7.2 Pagination
Another common form of dividing a website into several sub-pages is pagination. You all know this from everyday life on the internet. We enter a search term in Google and get results that are divided over several pages. These are accessible and navigable via the numbered links and the forward/back arrows at the bottom of the browser content. This is “pagination in action” and we encounter similar variants on many websites.
In the press release archive of the website of the Brandenburg state parliament, pagination is used to distribute the releases over several sub-pages. To illustrate the scraping of such a website, we can, for example, aim at scraping the date, title and further link for all press releases from 2020 and summarise them in a tibble.
You can find the website at: https://www.landtag.brandenburg.de/de/aktuelles/presse/pressemitteilungsarchiv/archiv_pressemitteilungen_2020/980521
7.2.1 The query
Let’s first look at the page in the browser and examine what happens when we click through the sub-pages. If we select the second sub-page, the URL in the browser window changes. We see that the end of the URL is extended from “980521” to “980521?skip=15”. So a query is sent to the server and the corresponding sub-page is sent back and displayed. On the third sub-page, the query changes again to “980521?skip=30”.
What could “skip=15” mean? If we look at the list of press releases, we see that exactly 15 releases are displayed on each sub-page. We can therefore assume that “skip=15” instructs the server to skip the first 15 releases and thus display entries 16-30. “skip=30” then skips the first 30 and so on. The principle can be further confirmed by testing what “skip=0” triggers. The first sub-page is displayed again. So “980521” is actually functionally equivalent to “980521?skip=0”.
With this we already know that we will be able to manipulate the URLs directly from Rstudio and thus scrape the sub-pages.
7.2.2 Scraping the sub-pages
Before we start scraping all the press releases, we first need to find out how many sub-pages there are. The easiest way would be to do this by eye. We can see in the browser that the highest selectable sub-page is “12”. But we can also find this out in the scraping process itself. This has several advantages. We might not only want to scrap the messages from 2020, but those from several or all years. To do this, we would have to check for each year in the browser how many sub-pages there are and adjust this accordingly. If we extract the page number in the R code, this can be easily generalised to other years with different page numbers. If, on the other hand, the goal were to scrape the messages from 2021, we would have to check again whether the page number has changed every time we repeat the process over the course of the year, as new press releases are added. By extracting it in the code, this step is eliminated and we can run the script regularly with minimal effort and be sure that it remains functional.
The links to the sub-pages, are contained in the HTML code in an unordered list
(<ul>
). The penultimate list element <li>
contains the page number of the
last page, here “12”. Please note that the last list element is the “Forward”
button. With this information we can construct a selector and extract the
highest page number, as the second to last list element.
<- "https://www.landtag.brandenburg.de/de/aktuelles/presse/pressemitteilungsarchiv/archiv_pressemitteilungen_2020/980521" %>%
website read_html()
<- website %>%
max html_node(css = "ul.pagination.pagination-sm > li:nth-last-child(2)") %>%
html_text() %>%
as.numeric()
In the last line of code above, you again see the function as.numeric()
.
Remember, that html_text()
always returns a character vector. Since we need a
numeric value to be able to calculate with it in R, we have to convert it into a
number.
Now we can start constructing the links to all sub-pages directly in our R
script. To do this, we need two components that we can then combine to create
the URLs. First we have to define the constant part of the URL, the base URL. In
this case, this is the complete URL up to “?skip=” inclusive. In addition, we
need the values that are inserted after “?skip=”. We can easily calculate these.
Each sub-page contains 15 press releases. So we can multiply the number of the
sub-page by 15, but then we have to subtract another 15, because the 15 press
releases shown on the current sub-page should not be skipped. So for page 1 we
calculate: \(1 ∗ 15 - 15 = 0\), for page 2: \(2 ∗ 15 - 15 = 15\) and so on. To do
this in one step for all sub-pages, we can use 1:max * 15 - 15
to instruct R to
repeat the calculation for all numbers from 1 to the maximum value – which we
previously stored in the object max
. :
stands for “from-to”. In this way we
get a numeric vector with the values for “?skip=”. In the third step we can
combine the base URL and the calculated values with str_c()
to complete URLs
and parse them in the fourth step with map()
.
<- "https://www.landtag.brandenburg.de/de/aktuelles/presse/pressemitteilungsarchiv/archiv_pressemitteilungen_2020/980521?skip="
base_url
<- 1:max * 15 - 15
skips
skips ## [1] 0 15 30 45 60 75 90 105 120 135 150 165
<- str_c(base_url, skips)
links
<- links %>%
pages map(read_html)
Now we can extract the data we are interested in. Let’s start with the date of
the press release. This is enclosed by a <p>
tag of the class date
. With
map()
we first select the corresponding nodes and in the next step extract the
text of these nodes. Since at this point a list of lists is returned, and this
would be unnecessarily complicated in further processing, we can use unlist()
to receive a character vector as output.
%>%
pages map(html_nodes, css = "p.date") %>%
map(html_text) %>%
unlist() %>%
head(n = 5)
## [1] "30.12.2020" "17.12.2020" "11.12.2020" "10.12.2020" "04.12.2020"
However, we can go one step further and store the data as a vector of the type
“Date”. This could be advantageous for potential further analyses. The tidyverse
package lubridate makes it easy to convert dates from character or numeric
vectors into the “Date” format. The package is not part of the core tidyverse
and has to be loaded explicitly. Among other things, it offers a number of
functions in the form dmy()
, mdy()
, ymd()
and so on. d
stands for “day”,
m
for “month” and y
for “year”. With the order in which the letters appear
in the function name, we tell R which format the data we want to convert to the
“date” format has. On the website of the Brandenburg state parliament, the dates
are written in the form Day.Month.Year, which is typical in Germany. So we use
the function dmy()
. If, for example, they were in the form Month.Day.Year,
which is typical in the USA, we would have to use mdy()
accordingly. It is
irrelevant whether the components of the date are separated with “.”, “/”, “-”
or spaces. Even written out or abbreviated month names can be processed by
lubridate.
library(lubridate)
%>%
pages map(html_nodes, css = "p.date") %>%
map(html_text) %>%
unlist() %>%
dmy() %>%
head(n = 5)
## [1] "2020-12-30" "2020-12-17" "2020-12-11" "2020-12-10" "2020-12-04"
More about the handling of dates and times in R as well as the further possibilities opened up by lubridate, can be found in the corresponding chapter in “R for Data Science”: https://r4ds.had.co.nz/dates-and-times.html
Next, we can extract the titles of the press releases. By now you will be able to understand the code and the CSS selector for this yourself.
%>%
pages map(html_nodes, css = "p.result-name") %>%
map(html_text, trim = TRUE) %>%
unlist() %>%
head(n = 5)
## [1] "Zum Tode von Paul-Heinz Dittrich"
## [2] "Symbol für Zusammenhalt auch in schwierigen Zeiten: Parlament und Regierung erhalten Fotocollage zu Erntekronen"
## [3] "Termine des Landtages Brandenburg in der Zeit vom 12. bis 20. Dezember 2020"
## [4] "Hinweise für Medien zu den Plenarsitzungen des Landtages vom 15. bis 18. Dezember 2020"
## [5] "Termine des Landtages Brandenburg in der Zeit vom 7. bis 13. Dezember 2020"
The last thing to extract, are the links to the individual messages. Here, too, nothing surprising happens at first.
%>%
pages map(html_nodes, css = "li.ce.ce-teaser > a") %>%
map(html_attr, name = "href") %>%
unlist() %>%
head(n = 5)
## [1] "/de/meldungenzum_tode_von_paul-heinz_dittrich/980480?_referer=980521"
## [2] "/de/meldungensymbol_fuer_zusammenhalt_auch_in_schwierigen_zeiten:_parlament_und_regierung_erhalten_fotocollage_zu_erntekronen/979730?_referer=980521"
## [3] "/de/meldungentermine_des_landtages_brandenburg_in_der_zeit_vom_12._bis_20._dezember_2020/978690?_referer=980521"
## [4] "/de/meldungenhinweise_fuer_medien_zu_den_plenarsitzungen_des_landtages_vom_15._bis_18._dezember_2020/976366?_referer=980521"
## [5] "/de/meldungentermine_des_landtages_brandenburg_in_der_zeit_vom_7._bis_13._dezember_2020/974725?_referer=980521"
But, the links stored in the HTML code only describe a part of the complete URL.
We could now construct the complete URLs again with str_c()
. However, we still
need a new concept for this. The pipe passes the result of a step along to the
next line. If we use str_c()
within the pipe, it receives the extracted end
part of the URLs as the first argument.
str_c("https://www.landtag.brandenburg.de")
would therefore lead to the end
part of the URL being appended before “https://www.landtag.brandenburg.de”.
However, we want this to happen the other way round. To do this, we need to tell
str_c()
, to use the data passed through the pipe as the second argument. We
can achieve this by using .
. .
refers to the data passed through the pipe.
In this way we can combine the URLs correctly:
%>%
pages map(html_nodes, css = "li.ce.ce-teaser > a") %>%
map(html_attr, name = "href") %>%
unlist() %>%
str_c("https://www.landtag.brandenburg.de", .) %>%
head(n = 5)
## [1] "https://www.landtag.brandenburg.de/de/meldungenzum_tode_von_paul-heinz_dittrich/980480?_referer=980521"
## [2] "https://www.landtag.brandenburg.de/de/meldungensymbol_fuer_zusammenhalt_auch_in_schwierigen_zeiten:_parlament_und_regierung_erhalten_fotocollage_zu_erntekronen/979730?_referer=980521"
## [3] "https://www.landtag.brandenburg.de/de/meldungentermine_des_landtages_brandenburg_in_der_zeit_vom_12._bis_20._dezember_2020/978690?_referer=980521"
## [4] "https://www.landtag.brandenburg.de/de/meldungenhinweise_fuer_medien_zu_den_plenarsitzungen_des_landtages_vom_15._bis_18._dezember_2020/976366?_referer=980521"
## [5] "https://www.landtag.brandenburg.de/de/meldungentermine_des_landtages_brandenburg_in_der_zeit_vom_7._bis_13._dezember_2020/974725?_referer=980521"
As always, we can perform the complete extraction of the data during the construction of a tibble:
tibble(
date = pages %>%
map(html_nodes, css = "p.date") %>%
map(html_text) %>%
unlist() %>%
dmy(),
name = pages %>%
map(html_nodes, css = "p.result-name") %>%
map(html_text, trim = TRUE) %>%
unlist(),
link = pages %>%
map(html_nodes, css = "li.ce.ce-teaser > a") %>%
map(html_attr, name = "href") %>%
unlist() %>%
str_c("https://www.landtag.brandenburg.de", .)
)## # A tibble: 175 x 3
## date name link
## <date> <chr> <chr>
## 1 2020-12-30 Zum Tode von Paul-Heinz Dittrich https://www.landtag.brandenburg…
## 2 2020-12-17 Symbol für Zusammenhalt auch in … https://www.landtag.brandenburg…
## 3 2020-12-11 Termine des Landtages Brandenbur… https://www.landtag.brandenburg…
## 4 2020-12-10 Hinweise für Medien zu den Plena… https://www.landtag.brandenburg…
## 5 2020-12-04 Termine des Landtages Brandenbur… https://www.landtag.brandenburg…
## 6 2020-11-27 Termine des Landtages Brandenbur… https://www.landtag.brandenburg…
## 7 2020-11-25 UN-Women-Flagge weht im Innenhof… https://www.landtag.brandenburg…
## 8 2020-11-23 Hinweise für Medien zur Sondersi… https://www.landtag.brandenburg…
## 9 2020-11-23 Attikafiguren von Perseus und An… https://www.landtag.brandenburg…
## 10 2020-11-20 Termine des Landtages Brandenbur… https://www.landtag.brandenburg…
## # … with 165 more rows