10 Scraping Berlin police reports

From here on out, we will talk about the basics of data analysis, and of using regular expressions when dealing with textual data. We will start by showing, and briefly explaining, how we scrape the data we will be using as an example for the remaining chapters.

The website https://www.berlin.de/polizei/polizeimeldungen/archiv/ contains reports by the Berlin police that are open to the public, beginning with the year 2014. We will gather the links to all subpages, download them and extract the text of the report, as well as the date, time and district where it occurred.

10.1 Gathering the links

library(tidyverse)
library(rvest)

We begin by downloading the mainpage.

website <- "https://www.berlin.de/polizei/polizeimeldungen/archiv/" %>% 
  read_html()

The next step is to extract the URLs for the yearly archives. All the <a> tags that contain those links, have a title= attribute whose value begins with “Link”. We can use this in selector construction, extract the value of the href= attribute and, as they are incomplete URLs, append them to the base URL.

links <- website %>% 
  html_nodes(css = "div.html5-section.body a[title^='Link']") %>% 
  html_attr(name = "href") %>% 
  str_c("https://www.berlin.de", .)

Each yearly archive page contains several subpages for the reports in that year, divided by pagination; to gather the links to all the subpages, we have to understand how the pagination works. In this case, the URL for the yearly archive is simply appended with the query “?page_at_1_0=” and a value indicating the number of the subpage we want to access. So “https://www.berlin.de/polizei/polizeimeldungen/archiv/2020/?page_at_1_0=5” for example gives access to the fifth page for 2020.

The number of police reports is not constant over the years; the number of subpages per year will also differ. We do not know in advance, how many subpages we will have to download, but we can scrape this information from the last pagination link for each year. We read the first page for each year, select the <a> tag for the last subpage, extract the text from it and save it as a number. Note that I also added a waiting time of two seconds between each download, as shown in chapter 9.

max_pages <- links %>% 
  map(~ {
    Sys.sleep(2)
    read_html(.x)
  }) %>%
  map(html_node, css = "li.pager-item.last > a") %>% 
  map(html_text) %>% 
  as.numeric()

Now we can construct the links to all subpages; done by nesting two for loops in my script.

pag_links <- character()

for (i in 1:length(links)) {
  for (j in 1:max_pages[i]) {
    pag_links <- append(pag_links, values = str_c(links[i], "?page_at_1_0=", j))
  }
}

The outer loop is a counter for the elements in the objects links; in this case the loop counts from 1 to 8, one step for each yearly archive page. For each of these iterations, the inner loop will count from 1 to the number of the last subpage for this year, which we assigned to the vector max_pages. Using both counters together, we can construct the links to the subpages using str_c(). The function takes the link to a yearly archive indicated by the counter in the outer loop i, appends “?page_at_1_0=” and after this the value of the inner loop counter j. The resulting link is then added to the vector pag_links using the function append() which adds the data indicated by the argument values to the object specified in the first argument. For this to work, we first have to initialise the object as an empty object outside of the loop. Here I created it as an empty character vector, using pag_links <- character(). If you want to create an empty object without any data type assigned, you can also use empty_vector <- NULL.

10.2 Downloading the subpages

Now that the the links to all subpages are gathered, we can finally download them. Please note that the download will take up to 20 minutes due to the number of subpages (337 at the time of writing), and the additional waiting time of 2 seconds between each iteration.

pages <- pag_links %>% 
  map(~ {
    Sys.sleep(2)
    read_html(.x)
  })

10.3 Extracting the data of interest

The goal is to extract the text of the report, the date/time and the district the report refers to. Let us try to approach this using the techniques we developed over the last chapters:

reports <- tibble(
  Date = pages %>% 
    map(html_nodes, css = "div.cell.date") %>% 
    map(html_text) %>% 
    unlist(),
  Report = pages %>% 
    map(html_nodes, css = "div.cell.text > a") %>% 
    map(html_text) %>% 
    unlist(),
  District = pages %>% 
    map(html_nodes, css = "span.category") %>% 
    map(html_text) %>% 
    unlist()
)
## Error: Tibble columns must have compatible sizes.
## * Size 16698: Existing data.
## * Size 16303: Column `District`.
## ℹ Only values of size one are recycled.

That did not work. But why? Let us look at the error message we received. It informs us that the column “District” we tried to create, is shorter in length compared to the other two. Since we can not create tibbles out of columns with different lengths, we get this error.

The fact that the “District” column is shorter than the other two, must mean that there are police reports where the information on the district is not listed on the website, which we can confirm by browsing some of the subpages, e.g. https://www.berlin.de/polizei/polizeimeldungen/archiv/2014/.

10.3.1 Dealing with missing data in lists

Some of the <span> tags that contain the district are missing. Using the approach presented above, html_nodes() just extracts the <span> tags that are present. We tell R that we do want all <span> tags of the class “category”, and this is what R returns to us. The cases where the tag is missing, are skipped. But this is not what we want; what we actually want is, that R looks at every single police report and saves its text, date and time, as well as the district, if it is not missing. If it is missing, we want R to save a NA in the cell of the tibble, the representation of missing values in R.

The <div> and <span> tags that contain the data of interest are nested in <li> tags in this case. The <li> tags thus contain the whole set of data, the text, the date/time and the district. The approach here is to make R examine every single <li> tag and extract the data that is present, as well as save an NA for every piece of data that is missing.

To start, we have to extract all the <li> tags and their content from the subpages. Right now, the subpages are saved in the object pages. We use a for loop that takes every element of pages, extracts all the <li> tags from it and adds them to a new list using append(). append() takes an existing object as its first argument and adds the data indicated in the values = argument to it. For this to work, we have to initiate the new list as an empty object before the loop starts.

list_items <- NULL

for (i in 1:length(pages)) {
  list_items <- append(list_items, values = html_nodes(pages[[i]], css = "li.row-fluid"))
}

The newly created list list_items contains a nodeset for each <li> tag from all subpages:

list_items[[1]]
## {html_node}
## <li class="row-fluid">
## [1] <div class="span2 cell date">21.06.2021 15:17 Uhr</div>\n
## [2] <div class="span10 cell text">\n<a href="/polizei/polizeimeldungen/2021/p ...

We can now make R examine every element of this list one after the other and extract all the data they contain. But what happens when we try to extract a node that is not present in one of the elements? R returns an NA:

html_node(list_items[[1]], css = "span.notthere")
## {xml_missing}
## <NA>

Using a for loop that iterates over the elements in list_items, we write a tibble row by row, filling the data cells with the extracted information and with NA if a node could not be found for this element of list_items. We have to initiate the tibble before the for loop starts: we define the column names, the type of data to be saved and also the length of the columns. The latter is not strictly necessary as we could also have created a tibble with a column length of \(0\), but pre-defining their length increases computational efficiency. Still, the for loop has to iterate over several thousand elements and extract the data contained, which will take several minutes to complete.

reports <- tibble(
  Date = character()[1:length(list_items)],
  Report = character()[1:length(list_items)],
  District = character()[1:length(list_items)]
)

for (i in 1:length(list_items)) {
  reports[i, "Date"] <- html_node(list_items[[i]], css = "div.cell.date") %>% 
    html_text()
  reports[i, "Report"] <- html_node(list_items[[i]], css = "div.cell.text > a") %>% 
    html_text()
  reports[i, "District"] <- html_node(list_items[[i]], css = "span.category") %>% 
    html_text()
}

10.4 Saving the data

As discussed in chapter 8, we save the scraped data at this point. You have seen that we downloaded a lot of subpages, which took a considerable amount of time; if we repeat this for every instance of further data analysis, we create a lot of unnecessary traffic and waste a lot of our own time.

save(reports, file = "reports.RData")