Projects

Passive Portfolio Management (Data Science Capstone)

We constructed a novel data set of carefully engineered financial features and leveraged machine learning algorithms in Python to predict excess returns relative to an index, using custom-built frameworks for cross-validation. I was responsible for developing our modeling framework (including the custom-built frameworks we used for cross-validation in both time-series and panel-data contexts) and developed the code necessary to pull fundamentals data from the SimFin API. I plan to detail the modeling and cross-validation frameworks implemented separately in the blog. Resources will also be posted here.

Teaching Modeling with R

Advanced R: Modeling (Beamer Presentation)
Advanced R: Modeling script

For the first time this Intercession at Johns Hopkins SAIS (described below), I taught an “advanced” R course, essentially providing a very gentle introduction to statistical learning and predictive modeling using R (largely in response to requests from students). Both the R code and motivation for the curriculum were heavily influenced by the Introduction to Statistical Learning materials and book, here), and I defer credit to those authors for much of the work here. My goal was to offer some experience building and validating predictive models in R, which diverges somewhat from the heavily inferential econometric training that the standard curriculum offers.

Teaching Introductory R

Introduction to R Presentation (Beamer Presentation)
Introduction to R script

I’ve had the privilege of teaching R at Johns Hopkins SAIS during the Winter Intercession, and more recently during the Fall Semester, that is run for the MIEF cohort (from which I am alum). The structure of the intercession is designed to provide brief, intensive, skills-based courses to the cohort, covering a wide-range of topics including programming languages, investing, infrastructure finance, etc. I’ve used the presentation materials to teach R in this setting, as well as to teach R to incoming RAs at the Fed in the Monetary Affairs division. My focus is getting up and running the the basics and syntax, data cleaning and manipulation using dplyr, along with some select functions from the tidyr package, and introductory plotting using ggplot. In order to complete the course, students also complete a deliverable wherein they clean the World Development Indicators from the World Bank (a real-world data set that demands a fair bit of cleaning) and run some analysis using the functions and workflow we discuss in the course.

Home Credit Default Risk (Kaggle)

Annotated notebook for the Home Credit Default Risk competition
Download Jupyter notebook

Home Credit Default Risk was my first “real” competition on Kaggle. This notebook was truly built from the ground-up with no pre-existing structure to get things started, and I learned alot by struggling through the competition. I opted not to post an official kernel on the competition site on Kaggle because, amidst all my travel over the summer while the competition as live, I didn’t have time to annotate my work in a way that I felt made it helpful for others. I’ve posted it here instead to walk through the work I did in the competition.

The Titanic Dataset (Kaggle)

Annotated notebook tackling the Titanic dataset
Download Jupyter notebook

While I have no definitive proof of this, the Titanic dataset hosted on Kaggle is likely one of the most ubiquitous datasets in the field, largely because it’s such a common place for folks to take their first stab at an end to end data science project. I’m no exception - the Titanic dataset is an excellent place to start for a manageable first project, from EDA to running initial classification models. I took advantage of this by building out a fairly involved analysis of the dataset, demonstrating some useful techniques for exploring and visualizing data, as well as applying, evaluating, and comparing a number of workhorse machine learning models to tackle the underlying classification problem.

Monte Carlo Analysis in Excel (Audi Case Study)

Audi US Expansion Case Study Presentation Monte Carlo Model (Excel Download)

The presentation, and accompanying Excel workbook, are the culmination of a semester-long project to develop a Monte Carlo model that robustly models the possible states of the world facing Audi in deciding how to best expand into the US market in the coming years. Using assumptions generated from publically available information and reading, and using Monte Carlo analysis, we attempted to create a toy model of the U.S. economy, particularly as it related to a foreign auto maker, and allow the user to specify assumptions in a dashboard to directly compare a number of strategies. The workbook is fairly large (~13Mb) and will be downleaded by clicking the link above. New assumptions can be processed with the underlying Monte Carlo machinery using the CTRL-n shortcut. We do not represent Audi in any way - all opinions and analysis are our own, and are generated using publically available data at the time of writing (Spring 2017).