Blog

Ad Recall Entity Resolution

For Super Bowl LV Commercials, No Brand Fills Void Left by Budweiser

This was another opportunity to leverage some of the open-end canonicalization/entity resolution work laid out in last year’s analysis, and more generally in the Edit Distance and Approximate String Matching post below.

Election Issue Open-Ended Analysis

What Is the 2020 Election About? We Gave 4,400 People a Blank Space to Tell Us

We’ve been doing a lot to better operationalize the use of open-ended responses at Morning Consult, and this is a testament to that work. My colleague, Sam Goodgame, built the engine that powers this in Python, predominantly leveraging Spacy to do much of the NLP work. We do quite a bit of analysis in R, as well, (primarily a suite of functions to operationalize unaided awareness/recall type questions - a post on that to come) and we leveraged many of those utilities to map the categorized open-ended responses back to the respondent-level data, which let us cut things across demographics to deepen the analysis.

Congressional District Media Usage

The Congressional Districts Potentially Most Vulnerable to Facebook Misinformation
In Blue Districts, Republicans More Likely Than Democrats to Use Twitter
Fox News’ Upper Hand Next Year: The Country’s Eyeballs, Down to Districts

One of the mandates which I was brought on to fulfill at Morning Consult could largely be construed as “we’re sitting on mountains of data, and we need people like you to help us do more with it.” This was one of those tangible intersections of my work and niche interests in R (creating heatmaps, see below). This project involved taking millions of respondent-level observations from our brand tracking engine and distilling them into reprentative, congressional-district level data points (of particular interest in an election year). All of the data work that went into the maps above was done in R, along with the generation of the maps themselves (thank you USABoundaries package, which saved me the trouble of messing with the TIGER shape files directly) - credit goes to Joanna Piacenza for spinning the narratives around these data, and to our design team for prettifying things for publication.

A Capstone By Another Name (Goodbye EViews)

End-to-end Exchange Rate Forecasting Capstone Pipeline

In the “Research” tab above, I’ve posted my capstone (examining the role of “fundamentals” (or theory) in long-term exchange rate forecasting) from my master’s program. As a part of the MIEF Winter Intercession this year, I led an “R for Capstone: Case-Study,” as the latest chapter in my efforts to normalize R as a better alternative to the proprietary languages that typically dominate econometrics coursework. The R script above establishes an end-to-end analytics pipeline from data ingestion to reporting that constitutes a majority of the analysis featured in this research. In particular, I leverage quantmod, which is a phenomenal package for rapidly ingesting financial and macroeconomic time-series data from FRED and Yahoo! Finance, without the trouble of using something like httr to pull data from an API. With this lightweight ingestion set up, one should be able to pull this R script off the shelf, install all the requisite packages, and run all the same analysis (with visualizations and regression tables vis-a-vis stargazer!).

Edit Distance and Approximate String Matching

Edit Distance
Download RMarkdown

Distance metrics are excellent for implementing approximate string matching, whether as an initial step in entity resolution or building out programs to execute algorithms similar to those one would see in predictive text and spell check. This post delves into a dynamic programming implementation of edit distance to understand how things work “under-the-hood”, and then provides examples of using the built in adist and agrep functions readily available in R.

Scraping PDFs in R with `pdftools` and `stringr`

PDF scraping in R
Download RMarkdown

With the sheer volume at our disposal in today’s world, it’s no surprise that much of it is not neatly-packaged for use “out-of-the-box.” This brief tutorial demonstrates the uses of pdftools and stringr to automate the processing of tables that come in PDF files (e.g. tables available in newsletters, statistical releases, earnings reports, etc.). With a bit of regular expression know-how, and these packages, one can scrape tables off PDFs into tidy, tabular formats for further use in analytics pipelines.

Building heatmaps in R with `dplyr` and `ggplot2`

Heatmaps in R
Download RMarkdown

For my inaugural post in this blog I’m revisiting one of the first challenges I came up against in my time as an analyst at the Board. I didn’t fully appreciate the power of R prior to working through this exercise, and was blown away by how simple it can be to build incredible visualizations in R. This is a brief tutorial using the dplyr and ggplot2 libraries to work through the process of cleaning a poorly-formatted dataset and harness the power of the ggplot2 library to turn that data into a geographical heatmap of the United States, at the county level.