Early on in my time at the Federal Reserve, I was exposed to a number of datasets within our section that had robust geographic identifiers that made it possible to visualize the data based on geography. I had recalled sitting in a skills course in my graduate program that made it possible to create geographic heatmaps in Tableau, though with a fair amount of preprocessing. Lacking access to Tableau in my role, I wondered how difficult creating these kinds of visualizations would be with my (at the time) 6 short weeks of exposure to the R programming language. Naturally, as a testament to the power of vast array of libraries readily available in R, the answer was “yes.” As it turns out, with a bit of effort invested into the initial set up of the data.frame objects for plotting, and with the help of the ggplot2 library, putting together these kinds of visualizations is fairly straight-forward, and the resulting geographical heatmaps are quite impressive to look at and ubiquitous in news articles, blog posts, etc. that seek to portray trends across the United States.

The goal of this post is two-fold: To demonstrate how to effectively clean and manipulate messy data, and to demonstrate how useful the ggplot2 package can be for quickly building out publication-quality visualizations. Data cleaning and manipulation, fortunately or unfortunately depending on your perspective, constitutes much of the data analysis workflow (something like 80%). Becoming adept at this inevitably opens more time to be spent on actually conducting the analytics, whether that is modeling, visualization, or both, and I’m a devout evangelist of using the dplyr package for doing so.

Using libraries from the Tidyverse (predominantly dplyr and ggplot2), this tutorial will walk through some of the fundamental commands of the dplyr library for quick and human-readable data manipulation code, and will make use of ggplot2 to create geographical heatmaps. While these visualizations are static, I hope to revisit this tutorial in a later post to deploy them in Shiny for a more interactive experience, so stay tuned for that.

Set-up

I’m going to be operating under the assumption that the reader has R/RStudio installed on their device. I’ll begin by attaching the libraries I’ll be using throughout this exploration

library(dplyr)
library(ggplot2)
library(readxl)
library(viridis)
## Warning: package 'viridis' was built under R version 3.4.4
## Warning: package 'viridisLite' was built under R version 3.4.4

Naturally, we’ll need data that contains geographical information to visualize in a geographical heatmap. Here, I’m using unemployemnt and median household income data from the USDA and BLS available here. Since this data comes as an Excel Workbook (a .xlsx file) we’ll use the readxl package to load the data into memory.

# Loading data and skipping some title/sources in the data
unemployment_data <- read_excel("unemployment_data.xlsx", skip=7)

# Inspect
glimpse(unemployment_data)
## Observations: 3,275
## Variables: 52
## $ FIPStxt                                   <chr> "00000", "01000", "0...
## $ State                                     <chr> "US", "AL", "AL", "A...
## $ Area_name                                 <chr> "United States", "Al...
## $ Rural_urban_continuum_code_2013           <dbl> NA, NA, 2, 3, 6, 1, ...
## $ Urban_influence_code_2013                 <dbl> NA, NA, 2, 2, 6, 1, ...
## $ Metro_2013                                <dbl> NA, NA, 1, 1, 0, 1, ...
## $ Civilian_labor_force_2007                 <dbl> 152191093, 2175612, ...
## $ Employed_2007                             <dbl> 145156134, 2089127, ...
## $ Unemployed_2007                           <dbl> 7034959, 86485, 806,...
## $ Unemployment_rate_2007                    <dbl> 4.622451, 4.000000, ...
## $ Civilian_labor_force_2008                 <dbl> 153761095, 2176489, ...
## $ Employed_2008                             <dbl> 144860350, 2053477, ...
## $ Unemployed_2008                           <dbl> 8900745, 123012, 126...
## $ Unemployment_rate_2008                    <dbl> 5.788685, 5.700000, ...
## $ Civilian_labor_force_2009                 <dbl> 153825455, 2162999, ...
## $ Employed_2009                             <dbl> 139594698, 1924747, ...
## $ Unemployed_2009                           <dbl> 14230757, 238252, 24...
## $ Unemployment_rate_2009                    <dbl> 9.251237, 11.000000,...
## $ Civilian_labor_force_2010                 <dbl> 154270492, 2196042, ...
## $ Employed_2010                             <dbl> 139408092, 1964559, ...
## $ Unemployed_2010                           <dbl> 14862400, 231483, 22...
## $ Unemployment_rate_2010                    <dbl> 9.633988, 10.500000,...
## $ Civilian_labor_force_2011                 <dbl> 154606066, 2202670, ...
## $ Employed_2011                             <dbl> 140765697, 1990413, ...
## $ Unemployed_2011                           <dbl> 13840369, 212257, 21...
## $ Unemployment_rate_2011                    <dbl> 8.952022, 9.600000, ...
## $ Civilian_labor_force_2012                 <dbl> 155118904, 2176337, ...
## $ Employed_2012                             <dbl> 142600247, 2003290, ...
## $ Unemployed_2012                           <dbl> 12518657, 173047, 17...
## $ Unemployment_rate_2012                    <dbl> 8.070362, 8.000000, ...
## $ Civilian_labor_force_2013                 <dbl> 155485443, 2174000, ...
## $ Employed_2013                             <dbl> 144018031, 2017043, ...
## $ Unemployed_2013                           <dbl> 11467412, 156957, 16...
## $ Unemployment_rate_2013                    <dbl> 7.375232, 7.200000, ...
## $ Civilian_labor_force_2014                 <dbl> 156098972, 2161618, ...
## $ Employed_2014                             <dbl> 146470452, 2015087, ...
## $ Unemployed_2014                           <dbl> 9628520, 146531, 149...
## $ Unemployment_rate_2014                    <dbl> 6.168215, 6.800000, ...
## $ Civilian_labor_force_2015                 <dbl> 157044310, 2157376, ...
## $ Employed_2015                             <dbl> 148746591, 2025949, ...
## $ Unemployed_2015                           <dbl> 8297719, 131427, 133...
## $ Unemployment_rate_2015                    <dbl> 5.28368, 6.10000, 5....
## $ Civilian_labor_force_2016                 <dbl> 158921892, 2173175, ...
## $ Employed_2016                             <dbl> 151183680, 2045624, ...
## $ Unemployed_2016                           <dbl> 7738212, 127551, 132...
## $ Unemployment_rate_2016                    <dbl> 4.869192, 5.900000, ...
## $ Civilian_labor_force_2017                 <dbl> 160588515, 2168444, ...
## $ Employed_2017                             <dbl> 153594100, 2073106, ...
## $ Unemployed_2017                           <dbl> 6994415, 95338, 1001...
## $ Unemployment_rate_2017                    <dbl> 4.355489, 4.400000, ...
## $ Median_Household_Income_2016              <dbl> 57617, 46309, 54487,...
## $ Med_HH_Income_Percent_of_State_Total_2016 <dbl> NA, 100.0, 117.7, 12...

Data clean-up

Deferring to the framework formalized by Hadley Wickham (see ‘Tidy Data’), this is clearly a ‘messy’ dataset. As a quick aside, ‘tidy’ data has the following characteristics: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. With this framework in mind, it should be immediately clear why this data is ‘messy.’ As a panel dataset with both cross-sectional and time-series elements, this dataset is ‘wide’, such that the four variables of interest (Employed, Unemployed, Unemployment_rate and Civilian_labor_force), are spread out along the time dimension, rather than ‘long’ wherein the time elements are stacked atop one another. Moreover, we’re working with three different observational units, as the data contains observations at the country, state, and county levels. Looking at a single observation of data helps put this in context:

##   FIPStxt State          Area_name Rural_urban_continuum_code_2013
## 1   01001    AL Autauga County, AL                               2
##   Urban_influence_code_2013 Metro_2013 Civilian_labor_force_2007
## 1                         2          1                     24383
##   Employed_2007 Unemployed_2007 Unemployment_rate_2007
## 1         23577             806                    3.3
##   Civilian_labor_force_2008 Employed_2008 Unemployed_2008
## 1                     24687         23420            1267
##   Unemployment_rate_2008 Civilian_labor_force_2009 Employed_2009
## 1                    5.1                     24703         22301
##   Unemployed_2009 Unemployment_rate_2009
## 1            2402                    9.7

Rather than representing the time-dimension of this data with a separate column for each variable for each year, we’d like to stretch these out so that each observation is a county-year pair. Additionally, since our final geographical visualization will be at the county-level, we can remove the other observational units. Stripping some of the other identifier variables, we’ll instead end up with a dataset that has 3 columns of identifiers (FIPStxt, State, Area_name) and 4 variables (as above).

# Specify a vector to capture the years of observation
data_years <- c(seq(2007, 2017, by=1))

# Separate out the variables that are time series and those that are fixed
ts_vars <- c("Civilian_labor_force", "Employed", "Unemployed", "Unemployment_rate")
fixed_vars <- names(unemployment_data)[1:3]

I use a loop to parse out the time-series variables, place these cuts of the data in a list, and stack everything up ‘long’-format. Admittedly, we don’t need much dplyr syntax to make this work. At a high level this bit of code does the following: 1) For each year in the data_years vector created above, begins by pasting the year with the root of the time series variables to index them in the data.frame; 2) Select these variables, along with the ‘fixed’ variables, from the ‘wide’-format data.frame using the dplyr select statement; 3) Rename the variables with the year suffixes so they are uniform across all years; 4) Assign a date so we don’t lose the time dimension entirely; and 5) Store the data in a list. I then make handy use of the Reduce command to bind all the resulting year-level data.frames into a single data.frame object

# Create and pre-allocate an empty list to store the year subsets
ts_unemployment <- vector("list", length(data_years))
names(ts_unemployment) <- paste0("y", data_years)

for (y in data_years) {
  date_variables <- paste0(ts_vars, "_", y)
  
  year_data <- select(unemployment_data, fixed_vars, date_variables)
  names(year_data)[which(!names(year_data) %in% fixed_vars)] <- ts_vars
  year_data$date <- as.Date(paste0(y, "-01-01"))
  
  ts_unemployment[[paste0("y", y)]] <- year_data
}

# Stack everything row-wise
ts_unemployment <- Reduce("rbind", ts_unemployment)

# Inspect
glimpse(ts_unemployment)
## Observations: 36,025
## Variables: 8
## $ FIPStxt              <chr> "00000", "01000", "01001", "01003", "0100...
## $ State                <chr> "US", "AL", "AL", "AL", "AL", "AL", "AL",...
## $ Area_name            <chr> "United States", "Alabama", "Autauga Coun...
## $ Civilian_labor_force <dbl> 152191093, 2175612, 24383, 82659, 10334, ...
## $ Employed             <dbl> 145156134, 2089127, 23577, 80099, 9684, 8...
## $ Unemployed           <dbl> 7034959, 86485, 806, 2560, 650, 359, 849,...
## $ Unemployment_rate    <dbl> 4.622451, 4.000000, 3.300000, 3.100000, 6...
## $ date                 <date> 2007-01-01, 2007-01-01, 2007-01-01, 2007...

It’s worth noting that, in instances where you’re working with a fewer dimensions of these variables to gather (here we have three identifiers, four variables, and time elements for all of them) it can be much cleaner to deploy the gather function from tidyr. While this is immensely powerful for doing this sort of work, I find the syntax to be difficult ot manage, and for these cases the looping structure works just as well, and makes it clearer how the cleaning is being done.

Now that the data is formatted correctly (read ‘tidy’) we simply pare things down to the county level, removing state and country level observations, and arrange things according to the county-year pairings we created.

#Base R has a number of datasets 'hidden' as default objects that can be called using the 'data' function; here we generate a list of state names to filter
data(state)

#Using the dplyr 'filter' function to filter out state from county level data
county_unemp <- filter(ts_unemployment, !(Area_name %in% state.name) & Area_name!='United States')

#Sort the data by state/county, by year
county_unemp <- arrange(county_unemp, FIPStxt, Area_name, date)

# Inspect
glimpse(county_unemp)
## Observations: 35,464
## Variables: 8
## $ FIPStxt              <chr> "01001", "01001", "01001", "01001", "0100...
## $ State                <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL",...
## $ Area_name            <chr> "Autauga County, AL", "Autauga County, AL...
## $ Civilian_labor_force <dbl> 24383, 24687, 24703, 25713, 25836, 25740,...
## $ Employed             <dbl> 23577, 23420, 22301, 23431, 23677, 23961,...
## $ Unemployed           <dbl> 806, 1267, 2402, 2282, 2159, 1779, 1605, ...
## $ Unemployment_rate    <dbl> 3.3, 5.1, 9.7, 8.9, 8.4, 6.9, 6.2, 5.8, 5...
## $ date                 <date> 2007-01-01, 2008-01-01, 2009-01-01, 2010...

Formatting geographical characteristics

Fortunately for us ggplot2 has some built in geographical data objects that make formatting our data for a geographical heatmap fairly straightforward:

# Get county-level map data (available when ggplot2 is loaded)
counties_gg <- map_data("county")
glimpse(counties_gg)
## Observations: 87,949
## Variables: 6
## $ long      <dbl> -86.50517, -86.53382, -86.54527, -86.55673, -86.5796...
## $ lat       <dbl> 32.34920, 32.35493, 32.36639, 32.37785, 32.38357, 32...
## $ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ region    <chr> "alabama", "alabama", "alabama", "alabama", "alabama...
## $ subregion <chr> "autauga", "autauga", "autauga", "autauga", "autauga...

As we can see, this counties_gg object countains latitude and longitude pairs for each region (state) and subregion (county) in the United States. It’s these lat-long pairs that are in turn used by the ggplot2 geom_polygon function to actual fill the areas prescribed by those pairs. We’ll now need to get these lat-long pairs merged with our existing data, which will require a bit of string formatting/manipulation. Note that the subregion field of the counties_gg data is a lowercase county name without any additional text (no state abbreviation, no ‘County’, etc.). We’ll need to strip these for our merge to be stable. I’m using a bit of regex to simply this process, and a discussion/tutorial of using regular expressions is far outside the scope of this demonstration, but the following bit of code is a nifty way of replacing all instances of a space, followed by either County, Parish, Census Area, Borough, or Municipality and all subsequent text with an empty string, so we’re left with only the name of the county/subregion which will match up with the counties_gg data.

# Replace extraneous text from county names (Area_name variable)
county_unemp$subregion <- gsub(" (County|Parish|Census Area|Borough|Municipality).*", "", county_unemp$Area_name)

# Lowercase
county_unemp$subregion <- tolower(county_unemp$subregion)

We’ll need to convert the abbreviated state names to their proper spelling, again, making use of the data(state) objects. I merge in these characteristics using the dplyr left_join command. I find the dplyr join family to be far more intuitive than using base R’s merge function, and use them exclusively for merges. They do exactly what the names suggest, align exactly with their SQL complements, and they tend to be more stable and less prone to side-effects as a result.

# Create a state_info object to merge state names back in for our keys later
state_info <- cbind(state.name, state.abb)

# Since this is a matrix, we have to convert to a data.frame
state_info <- data.frame(state_info)
colnames(state_info) <- c("region", "state")

# Format keys correctly
state_info$region <- tolower(state_info$region)

# Merge in the state_info data.frame we created
county_data <- left_join(county_unemp, state_info, by=c("State"="state"))
## Warning: Column `State`/`state` joining character vector and factor,
## coercing into character vector
# Inspect
glimpse(county_data)
## Observations: 35,464
## Variables: 10
## $ FIPStxt              <chr> "01001", "01001", "01001", "01001", "0100...
## $ State                <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL",...
## $ Area_name            <chr> "Autauga County, AL", "Autauga County, AL...
## $ Civilian_labor_force <dbl> 24383, 24687, 24703, 25713, 25836, 25740,...
## $ Employed             <dbl> 23577, 23420, 22301, 23431, 23677, 23961,...
## $ Unemployed           <dbl> 806, 1267, 2402, 2282, 2159, 1779, 1605, ...
## $ Unemployment_rate    <dbl> 3.3, 5.1, 9.7, 8.9, 8.4, 6.9, 6.2, 5.8, 5...
## $ date                 <date> 2007-01-01, 2008-01-01, 2009-01-01, 2010...
## $ subregion            <chr> "autauga", "autauga", "autauga", "autauga...
## $ region               <chr> "alabama", "alabama", "alabama", "alabama...

Now that everything is set up for a merge with the lat-long pairs from the ggplot2 data, we’ll select on a single year of data and a specific variable for plotting and move forward with creating the heatmap. Here I focus on the most recent year of data and the Unemployment_rate data. In a future tutorial, with the help of Shiny, we’ll make use of the full assemblage of data when creating reactive heatmaps.

# Focus on 2017 numbers
county_data_17 <- filter(county_data, date == "2017-01-01")

# Pull our keys and some variables of interest
county_subset <- select(county_data_17, region, subregion, Unemployment_rate)

# Merge with counties_gg
map_data <- left_join(counties_gg, county_subset, by=c("region", "subregion"))

# Inspect
glimpse(map_data)
## Observations: 87,949
## Variables: 7
## $ long              <dbl> -86.50517, -86.53382, -86.54527, -86.55673, ...
## $ lat               <dbl> 32.34920, 32.35493, 32.36639, 32.37785, 32.3...
## $ group             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ order             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ region            <chr> "alabama", "alabama", "alabama", "alabama", ...
## $ subregion         <chr> "autauga", "autauga", "autauga", "autauga", ...
## $ Unemployment_rate <dbl> 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9,...

Creating the heatmap

The ggplot2 syntax is somewhat alien relative to, well, just about any other syntax in R or any other programming language that I’ve dealt with for that matter. This aside, it’s fairly easy to pick up. ggplot2 plots are built with in layers. I won’t go into greater detail on the syntax or other options the ggplot2 library, but the documentation available for the package is fantastic for getting up and running.

We call the ggplot function, specify the map_data data.frame, and begin setting our aes (aesthetic) parameters, here x and y correspond to the latitude and longitude pairs merged in above, and the group variable is set to the group variable from our data, which is a numerical mapping of each county in the dataset. We add an additional layer by ending the data layer line with a +. Using the geom_polygon function, we can fill each county polygon with Unemployment_ratevalues.

#Plot
ggplot(map_data, aes(x=long,y=lat,group=group)) +
  geom_polygon(aes(fill=Unemployment_rate))

The 80/20 rule certainly applies here - with 20% of the work, we’re about 80% of the way to a publication-quality heatmap. The remaining 20% requires a great deal more ggplot2 code detailed below, and commented line-by-line. Again, rather than detailing what each bit of syntax does, I’ll defer to the documentation linked above.

ggplot(map_data, aes(x=long,y=lat,group=group)) +
  geom_polygon(aes(fill=Unemployment_rate)) +
  
  # Delineate county borders with a thin line
  geom_path(size=0.1) +
  
  # Apply a viridis color-scale - "inferno" works well here
  scale_fill_viridis(option="inferno", direction=-1)+
  
  # Set title
  ggtitle("Unemployment Rate, by County - 2017") +
  
  # Specify legend placement and title text formatting; remove a number of theme items that clutter the plot
  theme(plot.title = element_text(size=9, face='bold'),
        legend.direction = "vertical",
        legend.position = "right",
        legend.title=element_blank(),
        panel.background=element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks = element_blank())

And with that, we have a high-quality geographical heatmap of unemployment across the United States. Naturally, this static visualization will leave most readers wanting - we’re unable to capture the time dimension that we worked so hard to clean up, and can’t necessarily isolate which counties are experiencing which rates of employment without prior knowledge, a map, or a list of the data. Providing options to move the plot through time, and to provide access to the names of the counties, makes this plot significantly more powerful. Shiny has the power to do just this, and I’ll revisit these improvements in a later post.