Scraping PDFs in R

Knowing how to source data from non-traditional sources (i.e. somewhere where the data can’t be easily downloaded in a neat, tabular way) is a powerful skill for any data scientist or analyst to have in their toolkit. Most commonly, this takes the form of webscraping (I’m looking at you BeautifulSoup and rvest) or pulling data through an API, like the Twitter API. Scraping data from PDFs is likely a slightly less-populated corner of this world, but valuable nonetheless. If your team relies on data that comes from, say, a PDF newsletter, that gets fed into an automated data pipeline, you’re generally left with two clear options: either the data can be highlighted, copied, and pasted (but will still need to be formatted manually) or hand key the values into another, more tabular format so they can be used in a more automated fashion (which requires even more manual work, and is even more vulnerable to human error).

Fortunately, as with most things in R (and Python, of course) there is a package for that- in this case I’ll be using pdftools to scrape tables from a statistical release produced by the Federal Reserve Board.I want to focus more on the general workflow, rather than the specifics necessary to scrape this particular PDF. Since the primary function in the pdftools package pulls the contents of PDF as raw text, regular expressions will be exceptionally useful for making this simple and flexible. With the pdftools library, a bit of regular expression know-how, and a bit of standardization (if the document you’re working with changes every time, no amount of regular expressions will make it feasible to automate), I hope this can be more broadly applicable to other, similar data sources and can save time and frustration.

Set-up

As always, I’m operating under the assumption that the reader has R/RStudio installed on their device. In addition to pdftools, I’ll be using a few other libraries from the tidyverse, predominantly stringr, which provides a better suite of functions for working with strings and regular expressions than is available in base R. The others, purrr is here for use of the map function, which provides a cleaner and arguably more stable alternative to the base R apply family.

library(pdftools)
library(stringr)
library(purrr)

Extracting the table

With the libraries attached, the obvious first step is to do the initial scraping of our PDF document. pdftools makes this a trivial exercise, and most of the work in this post will be devoted to a brief overview of regular expressions and string manipulation to get everything formatted nicely. Simply calling the pdf_text function on a pdf document scrapes the text from the document and stores in a variable of our choosing.

# Set path to the newsletter we want to scrape
pdf_name <- "g20.pdf"

# Read in newsletter text using the pdftools module
g20_text <- pdf_text(pdf_name)

So what does this look like? pdf_text creates a character vector, where each element is a page of the document, formatted as a single character string. To get a sense of what we’re working with, we can examine the first 1,000 characters of the first page.

str(g20_text)

##  chr [1:3] "G.20                                                             Finance Companies1                            "| __truncated__ ...

substr(g20_text[1],1,1000)

## [1] "G.20                                                             Finance Companies1                                         For immediate release\r\n                                                                       October 2018                                                   December 2018\r\nOwned and Managed Receivables Outstanding2\r\nSeasonally adjusted. Billions of dollars except as noted.\r\n                                                                                         2017                             2018\r\n                                                                                                                          r        r        r     p\r\n                                           2013     2014    2015    2016    2017       Q3       Q4      Q1      Q2     Q3      Aug      Sep    Oct\r\nTotal percent change (annual rate)3         1.1      2.0     0.4    -3.0     -1.8    -5.4      0.4    -0.1     0.0    -2.9      0.4    -13.0   -0.5\r\n     Consumer              "

Each page contains (at least) a table to scrape. In this example, I’ll work with the table on the second page, which looks like this :

# Separate out the page which contains the table of interest
table_page <- g20_text[[2]]

# Inspect
substr(table_page,1,1000)

## [1] "Owned and Managed Receivables Outstanding (Levels)2\r\n(Billions of dollars)\r\nNot seasonally adjusted\r\n                                                                                   2017                            2018\r\n                                                                                                                   r        r      r      p\r\n                                       2013    2014    2015    2016    2017      Q3       Q4     Q1      Q2     Q3      Aug    Sep     Oct\r\nTotal                               1,413.9 1,442.2 1,310.8 1,270.9 1,248.3 1,248.2 1,248.3 1,238.2 1,241.2 1,240.4 1,242.0 1,240.4 1,236.0\r\n  Consumer                            855.5   882.6   746.6   744.6   737.3   737.2   737.3   725.6   727.2   730.2   731.5   730.2   731.0\r\n     Motor vehicle loans              316.2   330.8   307.6   307.7   304.6   303.7   304.6   301.3   303.0   308.2   308.8   308.2   309.3\r\n     Motor vehicle leases             167.3   188.9   176.1   189.5   193."

At present, this looks more like a garbled mess of text than a workable table. Fortunately, it’s a fairly well-structured garbled mess of text, and with a bit of regular expression magic, can be condensed into a tidy bit of tabular data. The first thing to do is convert this single, long string into reasonable pieces. Using str_split from the stringr package, I can pass in a character upon which to split a string into either the elements of a vector (using the simplify argument) or list. So, the string ‘a|b|c|d|e|f’ could be broken out as, splitting on the ‘|’ character:

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "a"  "b"  "c"  "d"  "e"  "f"

Here, I’ll break at the new line symbol, or “\r\n”:

# Split full table into individual lines of text based on '\n' line break character
g20_table_raw <- str_split(table_page, "\r\n", simplify=TRUE)
str(g20_table_raw)

##  chr [1, 1:26] "Owned and Managed Receivables Outstanding (Levels)2" ...

print(g20_table_raw)

##      [,1]                                                 
## [1,] "Owned and Managed Receivables Outstanding (Levels)2"
##      [,2]                    [,3]                     
## [1,] "(Billions of dollars)" "Not seasonally adjusted"
##      [,4]                                                                                                                     
## [1,] "                                                                                   2017                            2018"
##      [,5]                                                                                                                                         
## [1,] "                                                                                                                   r        r      r      p"
##      [,6]                                                                                                                                        
## [1,] "                                       2013    2014    2015    2016    2017      Q3       Q4     Q1      Q2     Q3      Aug    Sep     Oct"
##      [,7]                                                                                                                                         
## [1,] "Total                               1,413.9 1,442.2 1,310.8 1,270.9 1,248.3 1,248.2 1,248.3 1,238.2 1,241.2 1,240.4 1,242.0 1,240.4 1,236.0"
##      [,8]                                                                                                                                         
## [1,] "  Consumer                            855.5   882.6   746.6   744.6   737.3   737.2   737.3   725.6   727.2   730.2   731.5   730.2   731.0"
##      [,9]                                                                                                                                         
## [1,] "     Motor vehicle loans              316.2   330.8   307.6   307.7   304.6   303.7   304.6   301.3   303.0   308.2   308.8   308.2   309.3"
##      [,10]                                                                                                                                        
## [1,] "     Motor vehicle leases             167.3   188.9   176.1   189.5   193.2   192.5   193.2   192.8   193.9   194.4   194.1   194.4   194.1"
##      [,11]                                                                                                                                        
## [1,] "     Revolving6                        67.1    60.3    25.7    25.5    26.6    26.1     26.6   24.6    24.1    23.5     23.7   23.5    23.5"
##      [,12]                                                                                                                                        
## [1,] "     Other7                           304.9   302.5   237.3   221.8   212.9   214.9   212.9   207.0   206.2   204.2   204.9   204.2   204.1"
##      [,13]                                                                                                                                        
## [1,] "  Real estate                         157.0   148.1   159.5   138.3   123.8   127.8   123.8   125.5   122.6   117.9   122.3   117.9   114.8"
##      [,14]                                                                                                                                        
## [1,] "     One-to-four family               114.7   104.6   123.4   106.8    95.2    99.0     95.2   96.9    94.8    89.9     94.5   89.9    87.1"
##      [,15]                                                                                                                                        
## [1,] "     Other                             42.3    43.5    36.0    31.5    28.6    28.8     28.6   28.7    27.8    28.0     27.8   28.0    27.7"
##      [,16]                                                                                                                                        
## [1,] "  Business                            401.3   411.5   404.7   388.0   387.2   383.2   387.2   387.1   391.4   392.3   388.2   392.3   390.2"
##      [,17]                                                                                                                                        
## [1,] "     Motor vehicles                   143.0   150.9   103.6   105.3   106.1   101.7   106.1   109.3   108.2   106.5   104.6   106.5   107.2"
##      [,18]                                                                                                                                        
## [1,] "       Retail loans                    28.1    30.0    14.5    15.0    15.7    15.2     15.7   15.9    16.1    16.7     16.4   16.7    16.8"
##      [,19]                                                                                                                                        
## [1,] "       Wholesale loans8                86.2    88.6    79.7    81.3    81.7    77.8     81.7   84.7    83.2    80.9     79.2   80.9    81.3"
##      [,20]                                                                                                                                        
## [1,] "       Leases                          28.8    32.4     9.4     8.9     8.7     8.7      8.7    8.7     8.9     9.0      9.0    9.0     9.0"
##      [,21]                                                                                                                                        
## [1,] "     Equipment                        172.4   173.9   218.9   197.2   195.3   196.7   195.3   197.9   201.8   201.7   200.6   201.7   199.5"
##      [,22]                                                                                                                                        
## [1,] "       Loans                          110.1   115.5   122.1   115.2   117.4   119.5   117.4   119.4   125.1   126.2   125.1   126.2   124.8"
##      [,23]                                                                                                                                        
## [1,] "       Leases                          62.3    58.4    96.8    82.0    77.9    77.2     77.9   78.5    76.7    75.5     75.5   75.5    74.7"
##      [,24]                                                                                                                                        
## [1,] "     Other business receivables9       85.9    86.7    82.2    85.6    85.8    84.8     85.8   79.8    81.4    84.1     83.0   84.1    83.6"
##      [,25]                                 [,26]
## [1,] "Footnotes appear on the third page." ""

This already resembles a table, and it’s only required a few lines of code to get here. For the rest of the formatting and cleaning, I’ll use two other functions from the stringr package: str_detect, and str_extract. As with most functions in the Tidyverse, the names are fairly self explanatory: str_detect will detect the location of a given string in a vector of strings, and return TRUE where it is found and FALSE otherwise, much like the base R grepl function. Unsurprisingly, str_extract will do the same, but instead return the chosen string or regular expression.

# A vector of strings
fruit_str <- c("apples", "oranges", "blueberries", "pineapples", "watermelon", "strawberries")

# str_detect
apples_lgl <- str_detect(fruit_str, "apples")
fruit_str[apples_lgl]

## [1] "apples"     "pineapples"

# str_extract
str_extract(fruit_str, "apples")

## [1] "apples" NA       NA       "apples" NA       NA

These functions detect or extract exactly what has been passed in as the pattern argument to the function. In order to build in some more power and a lot more flexibility, regular expressions can be exceptionally useful. I intend for this to serve as a fairly gentle introduction to regular expressions - they can be immeasurably powerful, but get exceptionally complicated very quickly. I have a copy of this cheat sheet posted up in my office, and Rex Egg is another great resource. I am by no means a regular expression master, but have developed enough of a command to make my life easier for automating these kinds of tasks. In order to prevent this from becoming a massive regular expression tutorial, I’m going to build my expressions before using them in the stringr functions and explain exactly what they’re doing, rather than providing a comprehensive overview. I’d encourage the reader to check out the resources for building out there regex knowledge.

Looking at the excerpt of the table above, I’m interested in dropping the header info. I can detect these by wrapping them in parentheses and separating them with the ‘|’ character, which essentially says detect ‘Owned’ OR ‘Footnotes’ OR ‘Billions’ OR adjusted. After extracting the location of these strings, I can drop their indexes from the vector. With only table entries to operate with, I’ll be using str_split again to split at spaces. The "\\s+" expression splits at all space characters ("\\s") that match at least one time (+) - this ensures we split at the blocks of spaces that have different lengths. Before that though, note the formatting of the first ‘real’ row of table entries. The commas, used to denote thousands, are going to cause us grief when we convert the values to numeric later on. So, before splitting things up, we can use str_replace_all to remove those pesky commas.

# Assemble a regular expression to identify table entries of interest
row_expr <- paste(c("Owned", "Footnotes", "Billions", "adjusted"), collapse = "|")
row_expr

## [1] "Owned|Footnotes|Billions|adjusted"

# Remove rows which do not contain values for input based on names from PDF
g20_table <- g20_table_raw[which(!str_detect(g20_table_raw, row_expr))]

# Remove commmas
g20_table <- map_chr(g20_table, str_replace_all, pattern=",", replacement="")

# Split at any number of spaces
elements <- map(g20_table, function(x) str_split(x, "\\s+", simplify = TRUE))
print(elements[[4]])

##      [,1]    [,2]     [,3]     [,4]     [,5]     [,6]     [,7]    
## [1,] "Total" "1413.9" "1442.2" "1310.8" "1270.9" "1248.3" "1248.2"
##      [,8]     [,9]     [,10]    [,11]    [,12]    [,13]    [,14]   
## [1,] "1248.3" "1238.2" "1241.2" "1240.4" "1242.0" "1240.4" "1236.0"

With these two commands, we’re nearly done formatting the values for the table. Looking at the elements list, there are numeric values from the table mixed with empty strings, and row. In order to ensure we strip the unnecessary text, but keep the values, I’ll need to build something a bit more flexible. The entry_expr expression below will detect the [0-9]*\\.[0-9]* pattern. [0-9]* will detect any digits ([0-9]), any number of times (the * modifer matches the digits any number of times). This will capture the whole number portions of the table entries. In order to capture the decimal point, I ‘escape’ it, as before, since the . means something special in regular expressions.I detect this expression vector by vector in the map function, and convert each piece to numeric. After binding the elements of the list together, we finally have the values neatly formatted exactly as they appear in the table.

# Remove unnecessary text to convert table values from text to numeric
entry_expr <- "[0-9]*\\.[0-9]*"
num_values <- map(elements, function(x) as.numeric(x[str_detect(x, entry_expr)]))

g20_values <- do.call("rbind", num_values)
print(g20_values)

##         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
##  [1,] 1413.9 1442.2 1310.8 1270.9 1248.3 1248.2 1248.3 1238.2 1241.2
##  [2,]  855.5  882.6  746.6  744.6  737.3  737.2  737.3  725.6  727.2
##  [3,]  316.2  330.8  307.6  307.7  304.6  303.7  304.6  301.3  303.0
##  [4,]  167.3  188.9  176.1  189.5  193.2  192.5  193.2  192.8  193.9
##  [5,]   67.1   60.3   25.7   25.5   26.6   26.1   26.6   24.6   24.1
##  [6,]  304.9  302.5  237.3  221.8  212.9  214.9  212.9  207.0  206.2
##  [7,]  157.0  148.1  159.5  138.3  123.8  127.8  123.8  125.5  122.6
##  [8,]  114.7  104.6  123.4  106.8   95.2   99.0   95.2   96.9   94.8
##  [9,]   42.3   43.5   36.0   31.5   28.6   28.8   28.6   28.7   27.8
## [10,]  401.3  411.5  404.7  388.0  387.2  383.2  387.2  387.1  391.4
## [11,]  143.0  150.9  103.6  105.3  106.1  101.7  106.1  109.3  108.2
## [12,]   28.1   30.0   14.5   15.0   15.7   15.2   15.7   15.9   16.1
## [13,]   86.2   88.6   79.7   81.3   81.7   77.8   81.7   84.7   83.2
## [14,]   28.8   32.4    9.4    8.9    8.7    8.7    8.7    8.7    8.9
## [15,]  172.4  173.9  218.9  197.2  195.3  196.7  195.3  197.9  201.8
## [16,]  110.1  115.5  122.1  115.2  117.4  119.5  117.4  119.4  125.1
## [17,]   62.3   58.4   96.8   82.0   77.9   77.2   77.9   78.5   76.7
## [18,]   85.9   86.7   82.2   85.6   85.8   84.8   85.8   79.8   81.4
##        [,10]  [,11]  [,12]  [,13]
##  [1,] 1240.4 1242.0 1240.4 1236.0
##  [2,]  730.2  731.5  730.2  731.0
##  [3,]  308.2  308.8  308.2  309.3
##  [4,]  194.4  194.1  194.4  194.1
##  [5,]   23.5   23.7   23.5   23.5
##  [6,]  204.2  204.9  204.2  204.1
##  [7,]  117.9  122.3  117.9  114.8
##  [8,]   89.9   94.5   89.9   87.1
##  [9,]   28.0   27.8   28.0   27.7
## [10,]  392.3  388.2  392.3  390.2
## [11,]  106.5  104.6  106.5  107.2
## [12,]   16.7   16.4   16.7   16.8
## [13,]   80.9   79.2   80.9   81.3
## [14,]    9.0    9.0    9.0    9.0
## [15,]  201.7  200.6  201.7  199.5
## [16,]  126.2  125.1  126.2  124.8
## [17,]   75.5   75.5   75.5   74.7
## [18,]   84.1   83.0   84.1   83.6

Though things have been split up with my explanations, it only took about 10 lines of code to take the PDF document and convert the tables on the last page into a tabular matrix that will automate smoothly. Naturally, the simplicity of this process is largely dependent on how well-structure the document you’re working with is, but assuming there is some structure in place, this is a fairly straight-forward process with the help of some regular expression work.

Formatting (table replication)

Getting the values themselves in this tabular format is the primary goal of this post, but these will be harder to work with and debug without knowledge of the structure of the PDF document. Relying on the underlying structure of the PDF remaining the same, with a bit more regular expression work, we can replicate the table essentially as it appears in the document, grabbing the table headers and row names to populate a dataframe that mirrors the structure of the table.

Getting row names will be straight-forward enough - we can adapt the previous structure to extract the elements containing text (i.e. character strings) instead of numbers. Since they’ve been split at spaces, we can simply paste them back together using str_c from stringr, using a single space as the separater.

# Recreate row names
str_values <- map(elements, function(x) x[str_detect(x, "[A-z]")])
str_values <- map(str_values, str_c, collapse=" ")

# Drop footnotes
row_names <- flatten_chr(str_values)
row_names <- str_replace(row_names, "[0-9]", "")

# Drop header info
row_names <- row_names[(which(str_detect(row_names, "^[A-Z][a-z]")))]

row.names(g20_values) <- row_names
print(g20_values)

##                              [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
## Total                      1413.9 1442.2 1310.8 1270.9 1248.3 1248.2
## Consumer                    855.5  882.6  746.6  744.6  737.3  737.2
## Motor vehicle loans         316.2  330.8  307.6  307.7  304.6  303.7
## Motor vehicle leases        167.3  188.9  176.1  189.5  193.2  192.5
## Revolving                    67.1   60.3   25.7   25.5   26.6   26.1
## Other                       304.9  302.5  237.3  221.8  212.9  214.9
## Real estate                 157.0  148.1  159.5  138.3  123.8  127.8
## One-to-four family          114.7  104.6  123.4  106.8   95.2   99.0
## Other                        42.3   43.5   36.0   31.5   28.6   28.8
## Business                    401.3  411.5  404.7  388.0  387.2  383.2
## Motor vehicles              143.0  150.9  103.6  105.3  106.1  101.7
## Retail loans                 28.1   30.0   14.5   15.0   15.7   15.2
## Wholesale loans              86.2   88.6   79.7   81.3   81.7   77.8
## Leases                       28.8   32.4    9.4    8.9    8.7    8.7
## Equipment                   172.4  173.9  218.9  197.2  195.3  196.7
## Loans                       110.1  115.5  122.1  115.2  117.4  119.5
## Leases                       62.3   58.4   96.8   82.0   77.9   77.2
## Other business receivables   85.9   86.7   82.2   85.6   85.8   84.8
##                              [,7]   [,8]   [,9]  [,10]  [,11]  [,12]
## Total                      1248.3 1238.2 1241.2 1240.4 1242.0 1240.4
## Consumer                    737.3  725.6  727.2  730.2  731.5  730.2
## Motor vehicle loans         304.6  301.3  303.0  308.2  308.8  308.2
## Motor vehicle leases        193.2  192.8  193.9  194.4  194.1  194.4
## Revolving                    26.6   24.6   24.1   23.5   23.7   23.5
## Other                       212.9  207.0  206.2  204.2  204.9  204.2
## Real estate                 123.8  125.5  122.6  117.9  122.3  117.9
## One-to-four family           95.2   96.9   94.8   89.9   94.5   89.9
## Other                        28.6   28.7   27.8   28.0   27.8   28.0
## Business                    387.2  387.1  391.4  392.3  388.2  392.3
## Motor vehicles              106.1  109.3  108.2  106.5  104.6  106.5
## Retail loans                 15.7   15.9   16.1   16.7   16.4   16.7
## Wholesale loans              81.7   84.7   83.2   80.9   79.2   80.9
## Leases                        8.7    8.7    8.9    9.0    9.0    9.0
## Equipment                   195.3  197.9  201.8  201.7  200.6  201.7
## Loans                       117.4  119.4  125.1  126.2  125.1  126.2
## Leases                       77.9   78.5   76.7   75.5   75.5   75.5
## Other business receivables   85.8   79.8   81.4   84.1   83.0   84.1
##                             [,13]
## Total                      1236.0
## Consumer                    731.0
## Motor vehicle loans         309.3
## Motor vehicle leases        194.1
## Revolving                    23.5
## Other                       204.1
## Real estate                 114.8
## One-to-four family           87.1
## Other                        27.7
## Business                    390.2
## Motor vehicles              107.2
## Retail loans                 16.8
## Wholesale loans              81.3
## Leases                        9.0
## Equipment                   199.5
## Loans                       124.8
## Leases                       74.7
## Other business receivables   83.6

Column names are a bit trickier, considering they rely on two levels of the table. We can get the first ‘row’ easily enough - the “Q[1-4]” structure is unique with respect to the rest of the table, so we can seek it out and use the resulting row. We’ll trim leading spaces with the str_trim function, and then split concatenated string into a vector where each element is a header.

# Recreate headers - identify quarter structure and pull
header_values <- map(elements, str_c, collapse=" ")
header_values <- flatten_chr(header_values[str_detect(header_values, "Q[1-4]")])

# Remove leading space
header_values <- str_trim(header_values, "both")

# Split to create a vector of headers
header_names <- str_split(header_values, pattern = " ", simplify = T)
##header_names <- flatten_chr(header_names)

colnames(g20_values) <- header_names
print(g20_values)

##                              2013   2014   2015   2016   2017     Q3
## Total                      1413.9 1442.2 1310.8 1270.9 1248.3 1248.2
## Consumer                    855.5  882.6  746.6  744.6  737.3  737.2
## Motor vehicle loans         316.2  330.8  307.6  307.7  304.6  303.7
## Motor vehicle leases        167.3  188.9  176.1  189.5  193.2  192.5
## Revolving                    67.1   60.3   25.7   25.5   26.6   26.1
## Other                       304.9  302.5  237.3  221.8  212.9  214.9
## Real estate                 157.0  148.1  159.5  138.3  123.8  127.8
## One-to-four family          114.7  104.6  123.4  106.8   95.2   99.0
## Other                        42.3   43.5   36.0   31.5   28.6   28.8
## Business                    401.3  411.5  404.7  388.0  387.2  383.2
## Motor vehicles              143.0  150.9  103.6  105.3  106.1  101.7
## Retail loans                 28.1   30.0   14.5   15.0   15.7   15.2
## Wholesale loans              86.2   88.6   79.7   81.3   81.7   77.8
## Leases                       28.8   32.4    9.4    8.9    8.7    8.7
## Equipment                   172.4  173.9  218.9  197.2  195.3  196.7
## Loans                       110.1  115.5  122.1  115.2  117.4  119.5
## Leases                       62.3   58.4   96.8   82.0   77.9   77.2
## Other business receivables   85.9   86.7   82.2   85.6   85.8   84.8
##                                Q4     Q1     Q2     Q3    Aug    Sep
## Total                      1248.3 1238.2 1241.2 1240.4 1242.0 1240.4
## Consumer                    737.3  725.6  727.2  730.2  731.5  730.2
## Motor vehicle loans         304.6  301.3  303.0  308.2  308.8  308.2
## Motor vehicle leases        193.2  192.8  193.9  194.4  194.1  194.4
## Revolving                    26.6   24.6   24.1   23.5   23.7   23.5
## Other                       212.9  207.0  206.2  204.2  204.9  204.2
## Real estate                 123.8  125.5  122.6  117.9  122.3  117.9
## One-to-four family           95.2   96.9   94.8   89.9   94.5   89.9
## Other                        28.6   28.7   27.8   28.0   27.8   28.0
## Business                    387.2  387.1  391.4  392.3  388.2  392.3
## Motor vehicles              106.1  109.3  108.2  106.5  104.6  106.5
## Retail loans                 15.7   15.9   16.1   16.7   16.4   16.7
## Wholesale loans              81.7   84.7   83.2   80.9   79.2   80.9
## Leases                        8.7    8.7    8.9    9.0    9.0    9.0
## Equipment                   195.3  197.9  201.8  201.7  200.6  201.7
## Loans                       117.4  119.4  125.1  126.2  125.1  126.2
## Leases                       77.9   78.5   76.7   75.5   75.5   75.5
## Other business receivables   85.8   79.8   81.4   84.1   83.0   84.1
##                               Oct
## Total                      1236.0
## Consumer                    731.0
## Motor vehicle loans         309.3
## Motor vehicle leases        194.1
## Revolving                    23.5
## Other                       204.1
## Real estate                 114.8
## One-to-four family           87.1
## Other                        27.7
## Business                    390.2
## Motor vehicles              107.2
## Retail loans                 16.8
## Wholesale loans              81.3
## Leases                        9.0
## Equipment                   199.5
## Loans                       124.8
## Leases                       74.7
## Other business receivables   83.6

At this point, the only thing missing is the row of years that floats above the last eight columns. It likely makes the most sense to concatenate the years onto the column names that are lacking a year, but doing so from raw text will be highly subject to the underlying spacing structure, inviting a significant amount of error in an automated framework. Here, I’d recommend relying on the underlying structure of the releases (of a monthly newsletter, statistical release, etc.) to enforce some pre-determined structure. For example, we’d expect the same structure for every October release of the G.20. Contingent upon where we are in the year of releases, we can dynamically set a vector of years to concatenate, should we need them. I’ll restrict this bit of code to deal with December only, but it can easily be built out to capture the structure of other months.

# Pull Month - Year portions of the release - the first will correspond to the month of the underlying data
months_expr <- "(January|February|March|April|May|June|July|August|September|October|November|December)"
release_no <- str_extract(g20_text[[1]], paste(months_expr, "[1-2][0-9]{3}"))

# Skeleton of control flow structure necessary to process releases - built out for October only
if(grepl("October", release_no)) {
  year_vec <- c("", "", "", "", "", "_2017", "_2017", "_2018", "_2018", "_2018", "_2018", "_2018", "_2018")
}

# Concatenate
colnames(g20_values) <- paste0(colnames(g20_values), year_vec)
print(g20_values)

##                              2013   2014   2015   2016   2017 Q3_2017
## Total                      1413.9 1442.2 1310.8 1270.9 1248.3  1248.2
## Consumer                    855.5  882.6  746.6  744.6  737.3   737.2
## Motor vehicle loans         316.2  330.8  307.6  307.7  304.6   303.7
## Motor vehicle leases        167.3  188.9  176.1  189.5  193.2   192.5
## Revolving                    67.1   60.3   25.7   25.5   26.6    26.1
## Other                       304.9  302.5  237.3  221.8  212.9   214.9
## Real estate                 157.0  148.1  159.5  138.3  123.8   127.8
## One-to-four family          114.7  104.6  123.4  106.8   95.2    99.0
## Other                        42.3   43.5   36.0   31.5   28.6    28.8
## Business                    401.3  411.5  404.7  388.0  387.2   383.2
## Motor vehicles              143.0  150.9  103.6  105.3  106.1   101.7
## Retail loans                 28.1   30.0   14.5   15.0   15.7    15.2
## Wholesale loans              86.2   88.6   79.7   81.3   81.7    77.8
## Leases                       28.8   32.4    9.4    8.9    8.7     8.7
## Equipment                   172.4  173.9  218.9  197.2  195.3   196.7
## Loans                       110.1  115.5  122.1  115.2  117.4   119.5
## Leases                       62.3   58.4   96.8   82.0   77.9    77.2
## Other business receivables   85.9   86.7   82.2   85.6   85.8    84.8
##                            Q4_2017 Q1_2018 Q2_2018 Q3_2018 Aug_2018
## Total                       1248.3  1238.2  1241.2  1240.4   1242.0
## Consumer                     737.3   725.6   727.2   730.2    731.5
## Motor vehicle loans          304.6   301.3   303.0   308.2    308.8
## Motor vehicle leases         193.2   192.8   193.9   194.4    194.1
## Revolving                     26.6    24.6    24.1    23.5     23.7
## Other                        212.9   207.0   206.2   204.2    204.9
## Real estate                  123.8   125.5   122.6   117.9    122.3
## One-to-four family            95.2    96.9    94.8    89.9     94.5
## Other                         28.6    28.7    27.8    28.0     27.8
## Business                     387.2   387.1   391.4   392.3    388.2
## Motor vehicles               106.1   109.3   108.2   106.5    104.6
## Retail loans                  15.7    15.9    16.1    16.7     16.4
## Wholesale loans               81.7    84.7    83.2    80.9     79.2
## Leases                         8.7     8.7     8.9     9.0      9.0
## Equipment                    195.3   197.9   201.8   201.7    200.6
## Loans                        117.4   119.4   125.1   126.2    125.1
## Leases                        77.9    78.5    76.7    75.5     75.5
## Other business receivables    85.8    79.8    81.4    84.1     83.0
##                            Sep_2018 Oct_2018
## Total                        1240.4   1236.0
## Consumer                      730.2    731.0
## Motor vehicle loans           308.2    309.3
## Motor vehicle leases          194.4    194.1
## Revolving                      23.5     23.5
## Other                         204.2    204.1
## Real estate                   117.9    114.8
## One-to-four family             89.9     87.1
## Other                          28.0     27.7
## Business                      392.3    390.2
## Motor vehicles                106.5    107.2
## Retail loans                   16.7     16.8
## Wholesale loans                80.9     81.3
## Leases                          9.0      9.0
## Equipment                     201.7    199.5
## Loans                         126.2    124.8
## Leases                         75.5     74.7
## Other business receivables     84.1     83.6

With that, not only have the table values been converted to a tidy, tabular format, but so have the headers and row.names. With a bit more time devoted to building out the control flow necessary to process each of the 12 statistical releases for the year, this process could be automated to populated a databases of these values, which could in turn be used for dynamic analysis and visualization. With just a bit of code, functions from the pdftools and stringr libraries, and some underlying structure, PDF tables can become a part of automated processes.

Scraping PDFs in R

Jared Berry

January 2019

Set-up

Extracting the table

Formatting (table replication)