Home Credit Default

Home Credit Default Risk is my first 'real' Kaggle competition submission. Since the Titantic dataset is so ubiquitous, and I got some help in establishing the initial structure from my MOOC resources, it didn't provide quite the same challenge as a legitimate competition would. Home Credit asked kagglers to predict whether or not an individual would pay back, or default on, a loan with the underlying mission of broadening financial inclusion to the unbanked community.

This is a larger dataset (though not huge), with approximately 300,000 observations in the training data and 50,000 in the test set, which also offered the possibility of using a number of auxiliary datasets. The primary training dataset included observations at the individual borrower level, while the supplementary datasets provided information at coarser units of observation. Given my time constraints when this competition was live (the Summer of 2018), I focused on doing the best I could with the main dataset, and did not spend the time incorporating the supplemental data. Moreover, since I didn't have time to annotate this notebook in a way I felt did the work here justice prior to competition close in August, I didn't submit a formal kernel. This notebook details the work I did on this competition end-to-end. While I didn't place particularly high in the public leaderboard, I learned a lot by struggling through this, and am happy with the progress I made working through things.

1. Set-up

In [199]:
# Import necessary libraries for data preparation/EDA
import os
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

%matplotlib inline

sns.set_style('darkgrid')
In [325]:
# Set directory and import application training data
os.chdir(r'C:\Users\Jared\Documents\datasci\projects\kaggle\home_credit_default')
application_train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')

application_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

I begin by stacking both the test and training sets on top of one another. Again, I acknowledge there are some potential issues with this, first and foremost being data leakage - obviously, using test data along with training data for pre-processing potentially injects information contained in the test data that we wouldn't typically be able to observe. Since this represents something of a full population provided by Kaggle, instances where I deal with things like mean/median imputing for missing values benefit from using the full sample. By concatenating, I'm able to separate out the test data from the training data when I actually conduct my modelling steps further on. It makes cleaning the data much easier, and provides more realistic estimates of the means and medians in these missing values than treating them separately.

In [348]:
# Stack train and test data
t_obs = application_train.shape[0]
print(t_obs)

test['TARGET'] = np.nan
test = test[application_train.columns.tolist()]

all_data = pd.concat([application_train, test], axis=0)
all_data.reset_index(inplace=True)
all_data.info()
307511
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 123 entries, index to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(66), int64(41), object(16)
memory usage: 334.3+ MB

2. EDA

I chose to essentially work with both my training dataset and the 'all_data' stacked data set concurrently in my exploratory data analysis and cleaning. I like to visualize the data I'm working with with respect to the target- naturally, that's not feasible with the stacked dataset which lacks the target, so both datasets will come into play here

2a. Target distribution

A quick countplot of the target dataset makes it clear that the dataset we're dealing with here is fairly imbalanced, which has some ramifications for how we evaluate/train our models (more on this below)

In [214]:
# Countplot of the target variable indicates how imbalanced this dataset is
sns.countplot('TARGET', data=application_train, palette='viridis')
plt.show()
print('% Positive Class: {}'.format(100*round(application_train['TARGET'].mean(),4)))
% Positive Class: 8.07

2b. Missing values

In [4]:
# Visualize missing data - determine the extent of missings in the dataset
# Not super informative with this many features
sns.heatmap(application_train.isnull(), yticklabels=False, xticklabels=False, cbar=False, cmap='viridis')
plt.show()
In [203]:
# Create a dataframe with information on these missing values
tot_msg = application_train.isnull().sum(axis=0)
pct_msg = tot_msg/all_data.shape[0]
msg_info = pd.concat([tot_msg, pct_msg], axis=1)
msg_info.columns = ['# Missing Obs.', '% Missing Obs.']

msg_info[2:62]
Out[203]:
# Missing Obs. % Missing Obs.
NAME_CONTRACT_TYPE 0 0.000000
CODE_GENDER 0 0.000000
FLAG_OWN_CAR 0 0.000000
FLAG_OWN_REALTY 0 0.000000
CNT_CHILDREN 0 0.000000
AMT_INCOME_TOTAL 0 0.000000
AMT_CREDIT 0 0.000000
AMT_ANNUITY 12 0.000034
AMT_GOODS_PRICE 278 0.000780
NAME_TYPE_SUITE 1292 0.003627
NAME_INCOME_TYPE 0 0.000000
NAME_EDUCATION_TYPE 0 0.000000
NAME_FAMILY_STATUS 0 0.000000
NAME_HOUSING_TYPE 0 0.000000
REGION_POPULATION_RELATIVE 0 0.000000
DAYS_BIRTH 0 0.000000
DAYS_EMPLOYED 0 0.000000
DAYS_REGISTRATION 0 0.000000
DAYS_ID_PUBLISH 0 0.000000
OWN_CAR_AGE 202929 0.569617
FLAG_MOBIL 0 0.000000
FLAG_EMP_PHONE 0 0.000000
FLAG_WORK_PHONE 0 0.000000
FLAG_CONT_MOBILE 0 0.000000
FLAG_PHONE 0 0.000000
FLAG_EMAIL 0 0.000000
OCCUPATION_TYPE 96391 0.270567
CNT_FAM_MEMBERS 2 0.000006
REGION_RATING_CLIENT 0 0.000000
REGION_RATING_CLIENT_W_CITY 0 0.000000
WEEKDAY_APPR_PROCESS_START 0 0.000000
HOUR_APPR_PROCESS_START 0 0.000000
REG_REGION_NOT_LIVE_REGION 0 0.000000
REG_REGION_NOT_WORK_REGION 0 0.000000
LIVE_REGION_NOT_WORK_REGION 0 0.000000
REG_CITY_NOT_LIVE_CITY 0 0.000000
REG_CITY_NOT_WORK_CITY 0 0.000000
LIVE_CITY_NOT_WORK_CITY 0 0.000000
ORGANIZATION_TYPE 0 0.000000
EXT_SOURCE_1 173378 0.486668
EXT_SOURCE_2 660 0.001853
EXT_SOURCE_3 60965 0.171127
APARTMENTS_AVG 156061 0.438060
BASEMENTAREA_AVG 179943 0.505096
YEARS_BEGINEXPLUATATION_AVG 150007 0.421066
YEARS_BUILD_AVG 204488 0.573993
COMMONAREA_AVG 214865 0.603121
ELEVATORS_AVG 163891 0.460038
ENTRANCES_AVG 154828 0.434599
FLOORSMAX_AVG 153020 0.429524
FLOORSMIN_AVG 208642 0.585654
LANDAREA_AVG 182590 0.512526
LIVINGAPARTMENTS_AVG 210199 0.590024
LIVINGAREA_AVG 154350 0.433257
NONLIVINGAPARTMENTS_AVG 213514 0.599329
NONLIVINGAREA_AVG 169682 0.476294
APARTMENTS_MODE 156061 0.438060
BASEMENTAREA_MODE 179943 0.505096
YEARS_BEGINEXPLUATATION_MODE 150007 0.421066
YEARS_BUILD_MODE 204488 0.573993
In [204]:
msg_info[63:122]
Out[204]:
# Missing Obs. % Missing Obs.
ELEVATORS_MODE 163891 0.460038
ENTRANCES_MODE 154828 0.434599
FLOORSMAX_MODE 153020 0.429524
FLOORSMIN_MODE 208642 0.585654
LANDAREA_MODE 182590 0.512526
LIVINGAPARTMENTS_MODE 210199 0.590024
LIVINGAREA_MODE 154350 0.433257
NONLIVINGAPARTMENTS_MODE 213514 0.599329
NONLIVINGAREA_MODE 169682 0.476294
APARTMENTS_MEDI 156061 0.438060
BASEMENTAREA_MEDI 179943 0.505096
YEARS_BEGINEXPLUATATION_MEDI 150007 0.421066
YEARS_BUILD_MEDI 204488 0.573993
COMMONAREA_MEDI 214865 0.603121
ELEVATORS_MEDI 163891 0.460038
ENTRANCES_MEDI 154828 0.434599
FLOORSMAX_MEDI 153020 0.429524
FLOORSMIN_MEDI 208642 0.585654
LANDAREA_MEDI 182590 0.512526
LIVINGAPARTMENTS_MEDI 210199 0.590024
LIVINGAREA_MEDI 154350 0.433257
NONLIVINGAPARTMENTS_MEDI 213514 0.599329
NONLIVINGAREA_MEDI 169682 0.476294
FONDKAPREMONT_MODE 210295 0.590293
HOUSETYPE_MODE 154297 0.433108
TOTALAREA_MODE 148431 0.416643
WALLSMATERIAL_MODE 156341 0.438846
EMERGENCYSTATE_MODE 145755 0.409131
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.002866
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.002866
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.002866
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.002866
DAYS_LAST_PHONE_CHANGE 1 0.000003
FLAG_DOCUMENT_2 0 0.000000
FLAG_DOCUMENT_3 0 0.000000
FLAG_DOCUMENT_4 0 0.000000
FLAG_DOCUMENT_5 0 0.000000
FLAG_DOCUMENT_6 0 0.000000
FLAG_DOCUMENT_7 0 0.000000
FLAG_DOCUMENT_8 0 0.000000
FLAG_DOCUMENT_9 0 0.000000
FLAG_DOCUMENT_10 0 0.000000
FLAG_DOCUMENT_11 0 0.000000
FLAG_DOCUMENT_12 0 0.000000
FLAG_DOCUMENT_13 0 0.000000
FLAG_DOCUMENT_14 0 0.000000
FLAG_DOCUMENT_15 0 0.000000
FLAG_DOCUMENT_16 0 0.000000
FLAG_DOCUMENT_17 0 0.000000
FLAG_DOCUMENT_18 0 0.000000
FLAG_DOCUMENT_19 0 0.000000
FLAG_DOCUMENT_20 0 0.000000
FLAG_DOCUMENT_21 0 0.000000
AMT_REQ_CREDIT_BUREAU_HOUR 41519 0.116543
AMT_REQ_CREDIT_BUREAU_DAY 41519 0.116543
AMT_REQ_CREDIT_BUREAU_WEEK 41519 0.116543
AMT_REQ_CREDIT_BUREAU_MON 41519 0.116543
AMT_REQ_CREDIT_BUREAU_QRT 41519 0.116543
AMT_REQ_CREDIT_BUREAU_YEAR 41519 0.116543
In [249]:
# In order to speed up some processing, remove the housing summary statistics that are predominantly missing from the data
to_drop_ = all_data.columns[45:92].tolist()
all_data.drop(to_drop_, axis=1, inplace=True)
all_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 76 columns):
index                          356255 non-null int64
SK_ID_CURR                     356255 non-null int64
TARGET                         307511 non-null float64
NAME_CONTRACT_TYPE             356255 non-null object
CODE_GENDER                    356255 non-null object
FLAG_OWN_CAR                   356255 non-null object
FLAG_OWN_REALTY                356255 non-null object
CNT_CHILDREN                   356255 non-null int64
AMT_INCOME_TOTAL               356255 non-null float64
AMT_CREDIT                     356255 non-null float64
AMT_ANNUITY                    356219 non-null float64
AMT_GOODS_PRICE                355977 non-null float64
NAME_TYPE_SUITE                354052 non-null object
NAME_INCOME_TYPE               356255 non-null object
NAME_EDUCATION_TYPE            356255 non-null object
NAME_FAMILY_STATUS             356255 non-null object
NAME_HOUSING_TYPE              356255 non-null object
REGION_POPULATION_RELATIVE     356255 non-null float64
DAYS_BIRTH                     356255 non-null int64
DAYS_EMPLOYED                  356255 non-null int64
DAYS_REGISTRATION              356255 non-null float64
DAYS_ID_PUBLISH                356255 non-null int64
OWN_CAR_AGE                    121014 non-null float64
FLAG_MOBIL                     356255 non-null int64
FLAG_EMP_PHONE                 356255 non-null int64
FLAG_WORK_PHONE                356255 non-null int64
FLAG_CONT_MOBILE               356255 non-null int64
FLAG_PHONE                     356255 non-null int64
FLAG_EMAIL                     356255 non-null int64
OCCUPATION_TYPE                244259 non-null object
CNT_FAM_MEMBERS                356253 non-null float64
REGION_RATING_CLIENT           356255 non-null int64
REGION_RATING_CLIENT_W_CITY    356255 non-null int64
WEEKDAY_APPR_PROCESS_START     356255 non-null object
HOUR_APPR_PROCESS_START        356255 non-null int64
REG_REGION_NOT_LIVE_REGION     356255 non-null int64
REG_REGION_NOT_WORK_REGION     356255 non-null int64
LIVE_REGION_NOT_WORK_REGION    356255 non-null int64
REG_CITY_NOT_LIVE_CITY         356255 non-null int64
REG_CITY_NOT_WORK_CITY         356255 non-null int64
LIVE_CITY_NOT_WORK_CITY        356255 non-null int64
ORGANIZATION_TYPE              356255 non-null object
EXT_SOURCE_1                   162345 non-null float64
EXT_SOURCE_2                   355587 non-null float64
EXT_SOURCE_3                   286622 non-null float64
OBS_30_CNT_SOCIAL_CIRCLE       355205 non-null float64
DEF_30_CNT_SOCIAL_CIRCLE       355205 non-null float64
OBS_60_CNT_SOCIAL_CIRCLE       355205 non-null float64
DEF_60_CNT_SOCIAL_CIRCLE       355205 non-null float64
DAYS_LAST_PHONE_CHANGE         356254 non-null float64
FLAG_DOCUMENT_2                356255 non-null int64
FLAG_DOCUMENT_3                356255 non-null int64
FLAG_DOCUMENT_4                356255 non-null int64
FLAG_DOCUMENT_5                356255 non-null int64
FLAG_DOCUMENT_6                356255 non-null int64
FLAG_DOCUMENT_7                356255 non-null int64
FLAG_DOCUMENT_8                356255 non-null int64
FLAG_DOCUMENT_9                356255 non-null int64
FLAG_DOCUMENT_10               356255 non-null int64
FLAG_DOCUMENT_11               356255 non-null int64
FLAG_DOCUMENT_12               356255 non-null int64
FLAG_DOCUMENT_13               356255 non-null int64
FLAG_DOCUMENT_14               356255 non-null int64
FLAG_DOCUMENT_15               356255 non-null int64
FLAG_DOCUMENT_16               356255 non-null int64
FLAG_DOCUMENT_17               356255 non-null int64
FLAG_DOCUMENT_18               356255 non-null int64
FLAG_DOCUMENT_19               356255 non-null int64
FLAG_DOCUMENT_20               356255 non-null int64
FLAG_DOCUMENT_21               356255 non-null int64
AMT_REQ_CREDIT_BUREAU_HOUR     308687 non-null float64
AMT_REQ_CREDIT_BUREAU_DAY      308687 non-null float64
AMT_REQ_CREDIT_BUREAU_WEEK     308687 non-null float64
AMT_REQ_CREDIT_BUREAU_MON      308687 non-null float64
AMT_REQ_CREDIT_BUREAU_QRT      308687 non-null float64
AMT_REQ_CREDIT_BUREAU_YEAR     308687 non-null float64
dtypes: float64(23), int64(41), object(12)
memory usage: 206.6+ MB

2c. Distributions of key variables

These variables represent what I would consider the foundation of how decisions to extend loans are made. It doesn't take much imagination or institutional knowledge to figure that the amount of income, credit, annuity, and employment variables would likely be highly important for determining whether or not an individual is able to pay back a loan. Here I explore the distributions of these variables, relative to the target, and uncover some idiosyncrasies about how they've been coded in the dataset

AMT_INCOME_TOTAL

In [250]:
application_train['Default'] = np.where(application_train['TARGET'] == 1, "Defaulted", "Paid")
In [208]:
# Look at distribution of AMT_INCOME_TOTAL
plt.figure(figsize=(18,6))
plt.title('Distribution of AMT_INCOME_TOTAL, by Default Status')
sns.stripplot(x='AMT_INCOME_TOTAL', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [19]:
# Given the clear outliers toward the end of the distribution, remove these
application_train['AMT_INCOME_TOTAL'].max()
Out[19]:
117000000.0
In [20]:
_ = application_train[application_train['AMT_INCOME_TOTAL'] != 1.170000e+08]
plt.figure(figsize=(18,6))
plt.title('Distribution of AMT_INCOME_TOTAL, by Default Status - No Outliers')
sns.stripplot(x='AMT_INCOME_TOTAL', y='Default', data=_, jitter=1, palette='viridis')
plt.show()
In [21]:
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_INCOME_TOTAL - No Outliers, Individual Defaults")
sns.distplot(_[_['TARGET']==1].AMT_INCOME_TOTAL)
plt.show()
In [22]:
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_INCOME_TOTAL - No Outliers, Individual Doesn't Default")
sns.distplot(_[_['TARGET']==0].AMT_INCOME_TOTAL)
plt.show()

AMT_CREDIT

In [23]:
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_CREDIT")
sns.stripplot(x='AMT_CREDIT', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [59]:
fig, ax = plt.subplots(figsize=(18,6))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].AMT_CREDIT, label=labels_[target])
    
plt.legend()
plt.show()

AMT_ANNUITY

In [28]:
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_ANNUITY")
sns.stripplot(x='AMT_ANNUITY', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [60]:
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].AMT_ANNUITY.dropna(), label=labels_[target])
      
plt.legend()
plt.show()

AMT_GOODS_PRICE

In [33]:
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_GOODS_PRICE")
sns.stripplot(x='AMT_GOODS_PRICE', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [61]:
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].AMT_GOODS_PRICE.dropna(), label=labels_[target])
    
plt.legend()
plt.show()

DAYS_EMPLOYED

In [35]:
plt.figure(figsize=(18,6))
plt.title("Distribution of DAYS_EMPLOYED")
sns.stripplot(x='DAYS_EMPLOYED', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [36]:
# Given the clear outliers toward the end of the distribution, remove these
application_train['DAYS_EMPLOYED'].max()
Out[36]:
365243
In [37]:
_ = application_train[application_train['DAYS_EMPLOYED'] != 365243]
plt.figure(figsize=(18,6))
plt.title('Distribution of DAYS_EMPLOYED, by Default Status - No Outliers')
sns.stripplot(x='DAYS_EMPLOYED', y='Default', data=_, jitter=1, palette='viridis')
plt.show()

DAYS_BIRTH

In [39]:
plt.figure(figsize=(18,6))
plt.title("Distribution of DAYS_BIRTH")
sns.stripplot(x='DAYS_BIRTH', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()

EXT_SOURCE Features

While there isn't a whole lot of background provided on how these external source variables are constructed, they VERY reliably top variable importance figures, and very clearly exhibit different distributions relative to the target variable, as the distribution plots here so. Based on some of the background from the competition, these are essentially scores from third-parties used to determine credit-worthiness of applicants. It follows that these should contain some predictive power with respect to whether or not a borrower defaults, so I suppose it is comforting/unexpected that they perform so well

In [62]:
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_1.dropna(), label=labels_[target])
    
plt.legend()
plt.show()
In [63]:
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_2.dropna(), label=labels_[target])
    
plt.legend()
plt.show()
In [64]:
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
    sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_3.dropna(), label=labels_[target])
    
plt.legend()
plt.show()

I was also curious about whether or not these external scores could represent transformations of some of the other variables in our dataset. It turns out that these external source variables are not highly correlated with the other variables in the dataset (the highest beign the 0.26 correlation between the first external source variable and the credit/goods price features). This further suggests that they add additional explantory power above and beyond what we can gather with the other features in the dataset

In [179]:
ext_and_amts = application_train[['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_ANNUITY',
                                  'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'Default']]

plt.figure(figsize=(10,6))
sns.heatmap(ext_and_amts.drop('Default',axis=1).corr(), cmap='viridis', annot=True)
plt.title("Correlations between 'AMT' and 'EXT_SOURCE' features")
plt.show()
In [181]:
plt.figure(figsize=(10,10))
sns.pairplot(ext_and_amts.dropna(), hue='Default')
plt.show()
<matplotlib.figure.Figure at 0x1f8d361f780>

2d. Visualizing categoricals

In [209]:
sns.countplot('TARGET', hue='CODE_GENDER', data=application_train, palette='viridis')
plt.show()
In [41]:
sns.countplot('TARGET', hue='FLAG_OWN_REALTY', data=application_train, palette='viridis')
plt.show()
In [42]:
sns.countplot('TARGET', hue='FLAG_OWN_CAR', data=application_train, palette='viridis')
plt.show()
In [43]:
sns.countplot('TARGET', hue='NAME_CONTRACT_TYPE', data=application_train, palette='viridis')
plt.show()
In [44]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_HOUSING_TYPE', data=application_train, palette='viridis')
plt.show()
In [45]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_TYPE_SUITE', data=application_train, palette='viridis')
plt.show()
In [46]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_INCOME_TYPE', data=application_train, palette='viridis')
plt.show()
In [47]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_EDUCATION_TYPE', data=application_train, palette='viridis')
plt.show()
In [48]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_FAMILY_STATUS', data=application_train, palette='viridis')
plt.show()
In [217]:
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='OCCUPATION_TYPE', data=application_train.dropna(), palette='viridis')
plt.show()

3. Feature engineering/data cleaning

I identified some of the idiosyncratic outliers in the dataset through my explanatory analysis above, and found two others trolling through other kernels in the competition. Since I did not end up using these features, I did not remove the strangely coded outliers/missing values therein. Moreover, rather than just dropping the outliers (as I have done in the application_train cut of the data), I replace them with null/NA values and then impute them, since all the test data must be present to generate predictions that can actually be scored

In [211]:
# Remove outliers per the findings above and determined from other kernels in the training data alone
application_train = application_train[application_train['AMT_INCOME_TOTAL'] != 1.170000e+08]
application_train = application_train[application_train['AMT_REQ_CREDIT_BUREAU_QRT'] != 261] # From other kernels
application_train = application_train[application_train['OBS_30_CNT_SOCIAL_CIRCLE'] < 300] # From other kernels
application_train['DAYS_EMPLOYED'] = (application_train['DAYS_EMPLOYED'].apply(lambda x: x if x != 365243 else np.nan))
In [349]:
# Remove outliers per the findings above and determined from other kernels in the full sample dataset
all_data['DAYS_EMPLOYED'] = (all_data['DAYS_EMPLOYED'].apply(lambda x: x if x != 365243 else np.nan))
all_data['AMT_INCOME_TOTAL'] = (all_data['AMT_INCOME_TOTAL'].apply(lambda x: x if x != 1.170000e+08 else np.nan))

#Since these feel more like outliers than encoded missing values, I replace them with the max value in the feature
all_data['AMT_INCOME_TOTAL'][all_data['AMT_INCOME_TOTAL'].isnull()] = all_data['AMT_INCOME_TOTAL'].dropna().max()
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

3a. Categorical clean-up

There are a fair amount of categorical variables in the data that need to be appropriately encoded so we can use them in the machine learning models below. Binary variables are simply pre-processed with the LabelEncoder, and OneHotEncoding is used on the categorical variables with multiple classes.

In [213]:
# Using LabelEncoder for our binary categoricals
from sklearn.preprocessing import LabelEncoder
In [214]:
binary_cats = ['FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE']
for cat in binary_cats:
    label = LabelEncoder()
    application_train[cat] = label.fit_transform(application_train[cat])
    
print(application_train[binary_cats].info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 306487 entries, 0 to 307510
Data columns (total 3 columns):
FLAG_OWN_REALTY       306487 non-null int64
FLAG_OWN_CAR          306487 non-null int64
NAME_CONTRACT_TYPE    306487 non-null int64
dtypes: int64(3)
memory usage: 9.4 MB
None
In [350]:
for cat in binary_cats:
    label = LabelEncoder()
    all_data[cat] = label.fit_transform(all_data[cat])
    
print(all_data[binary_cats].info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 3 columns):
FLAG_OWN_REALTY       356255 non-null int64
FLAG_OWN_CAR          356255 non-null int64
NAME_CONTRACT_TYPE    356255 non-null int64
dtypes: int64(3)
memory usage: 8.2 MB
None
In [253]:
# Using OneHotEncoder for multi-category categoricals
from sklearn.preprocessing import OneHotEncoder
In [219]:
# First impute majority class for null values in 'NAME_TYPE_SUITE'
application_train['NAME_TYPE_SUITE'][application_train['NAME_TYPE_SUITE'].isnull()] = 'Unaccompanied'
application_train['OCCUPATION_TYPE'][application_train['OCCUPATION_TYPE'].isnull()] = 'None'
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [351]:
all_data['NAME_TYPE_SUITE'][all_data['NAME_TYPE_SUITE'].isnull()] = 'Unaccompanied'
all_data['OCCUPATION_TYPE'][all_data['OCCUPATION_TYPE'].isnull()] = 'None'
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [352]:
multi_cats = ['CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_EDUCATION_TYPE', 
              'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE',
              'ORGANIZATION_TYPE', 'OCCUPATION_TYPE']
feat_matrices = []
for cat in multi_cats:
    
    # Convert to integer
    label = LabelEncoder()
    label_encoded = label.fit_transform(all_data[cat])
    
    # Convert to multi-category with OHE
    ohe = OneHotEncoder(sparse=False)
    label_encoded = label_encoded.reshape(len(label_encoded),1)
    ohe_encoded = ohe.fit_transform(label_encoded)
    
    # Formatting to concatenate with all_data
    ohe_df = pd.DataFrame(ohe_encoded)
    n_cats = ohe_df.columns.tolist()
    cat_names = [cat + "_CLASS_" + str(i) for i in n_cats]
    ohe_df.columns = cat_names
    
    feat_matrices.append(ohe_df)
        
all_multi_encs = pd.concat(feat_matrices, axis=1)
all_multi_encs.iloc[:,1:3].head()
Out[352]:
CODE_GENDER_CLASS_1 CODE_GENDER_CLASS_2
0 1.0 0.0
1 0.0 0.0
2 1.0 0.0
3 0.0 0.0
4 1.0 0.0

3b. Converting some continuous variables to binned variants

Based on my experience with some other competitions and several iterations of training models in this competition, creating coarser, binned variants of the continuous variables generally improves performance. I opted to create bins, with cuts based on quantiles, for all of the 'key' (read as, variables that make intuitive sense to include in the model) continuous variables.

Again, I perform these cleaning with both the training dataset alone (to visualize the bins relative to the target variable) and with respect to the training data. This also allows us to perform more granular imputation of missing values in the external source variables which are missing a fair amount of values.

AMT_ANNUITY

In [104]:
# Create a AMT_ANNUITY bucket feature
application_train['AMT_ANNUITY'][application_train['AMT_ANNUITY'].isnull()] = application_train['AMT_ANNUITY'].median()
application_train['AMT_ANNUITY_BUCKETS'] = pd.qcut(application_train['AMT_ANNUITY'], 5, labels=np.arange(0,5))

sns.countplot(application_train['AMT_ANNUITY_BUCKETS'], palette='viridis')
plt.show()
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [105]:
sns.countplot(x='Default', hue='AMT_ANNUITY_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [353]:
all_data['AMT_ANNUITY'][all_data['AMT_ANNUITY'].isnull()] = all_data['AMT_ANNUITY'].median()
all_data['AMT_ANNUITY_BUCKETS'] = pd.qcut(all_data['AMT_ANNUITY'], 5, labels=np.arange(0,5))
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

AMT_INCOME_TOTAL

In [106]:
# Create a AMT_INCOME_TOTAL bucket feature
application_train['AMT_INCOME_TOTAL_BUCKETS'] = pd.qcut(application_train['AMT_INCOME_TOTAL'], 7, labels=np.arange(0,7))

sns.countplot(application_train['AMT_INCOME_TOTAL_BUCKETS'], palette='viridis')
plt.show()
In [107]:
sns.countplot(x='Default', hue='AMT_INCOME_TOTAL_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [354]:
all_data['AMT_INCOME_TOTAL_BUCKETS'] = pd.qcut(all_data['AMT_INCOME_TOTAL'], 7, labels=np.arange(0,7))

AMT_CREDIT

In [108]:
# Create a AMT_CREDIT bucket feature
application_train['AMT_CREDIT_BUCKETS'] = pd.qcut(application_train['AMT_CREDIT'], 4, labels=np.arange(0,4))

sns.countplot(application_train['AMT_CREDIT_BUCKETS'], palette='viridis')
plt.show()
In [109]:
sns.countplot(x='Default', hue='AMT_CREDIT_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [355]:
all_data['AMT_CREDIT_BUCKETS'] = pd.qcut(all_data['AMT_CREDIT'], 4, labels=np.arange(0,4))

AMT_GOODS_PRICE

In [110]:
# Since we're missing values here and the correlation between this and credit is really high, impute missings by those buckets
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_CREDIT_BUCKETS', y='AMT_GOODS_PRICE', data=application_train ,palette='viridis')
plt.show()
In [111]:
# Impute the missing values in AMT_GOODS_PRICE based on credit buckets
AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS = application_train.groupby('AMT_CREDIT_BUCKETS').median()['AMT_GOODS_PRICE']
print(AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS)

application_train['AMT_GOODS_PRICE'] = np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 0), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[0],
              np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 1), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[1],
              np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 2), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[2],
              np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 3), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[3],         
              application_train['AMT_GOODS_PRICE']))))
AMT_CREDIT_BUCKETS
0    180000.0
1    342000.0
2    585000.0
3    927000.0
Name: AMT_GOODS_PRICE, dtype: float64
In [112]:
application_train['AMT_GOODS_BUCKETS'] = pd.qcut(application_train['AMT_GOODS_PRICE'], 6, labels=np.arange(0,6))

sns.countplot(application_train['AMT_GOODS_BUCKETS'], palette='viridis')
plt.show()
In [113]:
sns.countplot('Default', hue='AMT_GOODS_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [356]:
# Impute the missing values in AMT_GOODS_PRICE based on credit buckets
AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS = all_data.groupby('AMT_CREDIT_BUCKETS').median()['AMT_GOODS_PRICE']
print(AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS)

all_data['AMT_GOODS_PRICE'] = np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 0), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[0],
              np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 1), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[1],
              np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 2), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[2],
              np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 3), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[3],         
              all_data['AMT_GOODS_PRICE']))))

all_data['AMT_GOODS_BUCKETS'] = pd.qcut(all_data['AMT_GOODS_PRICE'], 6, labels=np.arange(0,6))
AMT_CREDIT_BUCKETS
0    180000.0
1    337500.0
2    540000.0
3    904500.0
Name: AMT_GOODS_PRICE, dtype: float64

DAYS_REGISTRATION

In [114]:
days_feats = application_train[['DAYS_EMPLOYED', 'DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']]
print(days_feats.info())

plt.figure(figsize=(10,6))
sns.heatmap(days_feats.corr(), cmap='viridis', annot=True)
plt.title("Correlations between 'DAYS' features")
plt.show()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 306487 entries, 0 to 307510
Data columns (total 4 columns):
DAYS_EMPLOYED        251285 non-null float64
DAYS_BIRTH           306487 non-null int64
DAYS_REGISTRATION    306487 non-null float64
DAYS_ID_PUBLISH      306487 non-null int64
dtypes: float64(2), int64(2)
memory usage: 21.7 MB
None
In [115]:
# Create a DAYS_REGISTRATION bucket feature
application_train['DAYS_REGISTRATION_BUCKETS'] = pd.qcut(application_train['DAYS_REGISTRATION'], 4, labels=np.arange(0,4))

sns.countplot(application_train['DAYS_REGISTRATION_BUCKETS'], palette='viridis')
plt.show()
In [116]:
sns.countplot(x='Default', hue='DAYS_REGISTRATION_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [357]:
all_data['DAYS_REGISTRATION_BUCKETS'] = pd.qcut(all_data['DAYS_REGISTRATION'], 4, labels=np.arange(0,4))

DAYS_EMPLOYED

In [117]:
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='DAYS_REGISTRATION_BUCKETS', y='DAYS_EMPLOYED', data=application_train ,palette='viridis')
plt.show()
In [118]:
# Impute the missing values in DAYS_EMPLOYED based on credit buckets
DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS = application_train.groupby('DAYS_REGISTRATION_BUCKETS').median()['DAYS_EMPLOYED']
print(DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS)

application_train['DAYS_EMPLOYED'] = np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 0), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[0],
              np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 1), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[1],
              np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 2), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[2],
              np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 3), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[3],         
              application_train['DAYS_EMPLOYED']))))
DAYS_REGISTRATION_BUCKETS
0   -2057.0
1   -1714.0
2   -1567.0
3   -1436.0
Name: DAYS_EMPLOYED, dtype: float64
In [119]:
# Create a DAYS_EMPLOYED bucket feature
application_train['DAYS_EMPLOYED_BUCKETS'] = pd.qcut(application_train['DAYS_EMPLOYED'], 6, labels=np.arange(0,6))

sns.countplot(application_train['DAYS_EMPLOYED_BUCKETS'], palette='viridis')
plt.show()
In [120]:
sns.countplot(x='Default', hue='DAYS_EMPLOYED_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [358]:
DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS = all_data.groupby('DAYS_REGISTRATION_BUCKETS').median()['DAYS_EMPLOYED']
print(DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS)

all_data['DAYS_EMPLOYED'] = np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 0), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[0],
              np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 1), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[1],
              np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 2), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[2],
              np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 3), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[3],         
              all_data['DAYS_EMPLOYED']))))

all_data['DAYS_EMPLOYED_BUCKETS'] = pd.qcut(all_data['DAYS_EMPLOYED'], 6, labels=np.arange(0,6))
DAYS_REGISTRATION_BUCKETS
0   -2082.0
1   -1726.0
2   -1579.0
3   -1451.0
Name: DAYS_EMPLOYED, dtype: float64

DAYS_ID_PUBLISH

In [121]:
# Create a DAYS_ID_PUBLISH bucket feature
application_train['DAYS_ID_PUBLISH_BUCKETS'] = pd.qcut(application_train['DAYS_ID_PUBLISH'], 12, labels=np.arange(0,12))

sns.countplot(application_train['DAYS_ID_PUBLISH_BUCKETS'], palette='viridis')
plt.show()
In [122]:
sns.countplot(x='Default', hue='DAYS_ID_PUBLISH_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [359]:
all_data['DAYS_ID_PUBLISH_BUCKETS'] = pd.qcut(all_data['DAYS_ID_PUBLISH'], 12, labels=np.arange(0,12))
In [360]:
# Encode buckets to integers
type_series = all_data.dtypes == 'category'
cat_feats = type_series[type_series].index.tolist()

for cat in cat_feats:
    label = LabelEncoder()
    all_data[cat] = label.fit_transform(all_data[cat])

3c. External Source Variables

As I mentioned briefly above, these variables seem exceptionally valuable in predicting home credit default. Since they are missing variables fairly extensively, I opted to impute missing values based on the corresponding key variables that I binned in section 3b. The choice of these variables is determined based on their correlation with the external source variable in question.

EXT_SOURCE_1

In [124]:
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_CREDIT_BUCKETS', y='EXT_SOURCE_1', data=application_train ,palette='viridis')
plt.show()
In [125]:
# Impute the missing values in EXT_SOURCE_1 based on credit buckets
EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS = application_train.groupby('AMT_CREDIT_BUCKETS').median()['EXT_SOURCE_1']
print(EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS)
AMT_CREDIT_BUCKETS
0    0.454348
1    0.474239
2    0.514458
3    0.567700
Name: EXT_SOURCE_1, dtype: float64
In [126]:
application_train['EXT_SOURCE_1'] = np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 0), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[0],
              np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 1), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[1],
              np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 2), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[2],
              np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 3), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[3],         
              application_train['EXT_SOURCE_1']))))
In [361]:
EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS = all_data.groupby('AMT_CREDIT_BUCKETS').median()['EXT_SOURCE_1']
print(EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS)

all_data['EXT_SOURCE_1'] = np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 0), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[0],
              np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 1), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[1],
              np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 2), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[2],
              np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 3), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[3],         
              all_data['EXT_SOURCE_1']))))
AMT_CREDIT_BUCKETS
0    0.457615
1    0.474126
2    0.510874
3    0.570275
Name: EXT_SOURCE_1, dtype: float64

EXT_SOURCE_2

In [127]:
# Visualize the external source variable relative to the income buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_INCOME_TOTAL_BUCKETS', y='EXT_SOURCE_2', data=application_train ,palette='viridis')
plt.show()
In [128]:
# Impute the missing values in EXT_SOURCE_2 based on INCOME_TOTAL buckets
EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS = application_train.groupby('AMT_INCOME_TOTAL_BUCKETS').median()['EXT_SOURCE_2']
print(EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS)
AMT_INCOME_TOTAL_BUCKETS
0    0.516571
1    0.537897
2    0.554476
3    0.563728
4    0.577193
5    0.591770
6    0.623135
Name: EXT_SOURCE_2, dtype: float64
In [129]:
application_train['EXT_SOURCE_2'] = np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 0), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[0],
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 1), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[1],
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 2), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[2],
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 3), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[3],         
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 4), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[4],
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 5), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[5],
              np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 6), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[6],  
              application_train['EXT_SOURCE_2'])))))))
In [362]:
# Impute the missing values in EXT_SOURCE_2 based on INCOME_TOTAL buckets
EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS = all_data.groupby('AMT_INCOME_TOTAL_BUCKETS').median()['EXT_SOURCE_2']
print(EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS)

all_data['EXT_SOURCE_2'] = np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 0), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[0],
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 1), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[1],
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 2), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[2],
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 3), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[3],         
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 4), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[4],
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 5), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[5],
              np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 6), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[6],  
              all_data['EXT_SOURCE_2'])))))))
AMT_INCOME_TOTAL_BUCKETS
0    0.516462
1    0.536108
2    0.552568
3    0.562507
4    0.576656
5    0.595771
6    0.622017
Name: EXT_SOURCE_2, dtype: float64

EXT_SOURCE_3

As the correlation heatmap in the exploratory data analysis would suggest, the third external source variable is not well-correlated with any of the other 'key' variables in the data, so I opted for simple median-imputation here.

In [130]:
# EXT_SOURCE_3 is poorly correlated with all of the other AMT variables, so we will apply median imputing here
application_train['EXT_SOURCE_3'][application_train['EXT_SOURCE_3'].isnull()] = application_train['EXT_SOURCE_3'].median()
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [363]:
all_data['EXT_SOURCE_3'][all_data['EXT_SOURCE_3'].isnull()] = all_data['EXT_SOURCE_3'].median()
C:\Users\Jared\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Capturing non-linearities

I tried to exploit some non-linear combinations/interactions of the external source variables considering how powerful they are in model training, to see if any additional information can be gleaned from these kinds of transformations

In [131]:
application_train['EXT_SOURCE_1OVER2'] = application_train['EXT_SOURCE_1']/application_train['EXT_SOURCE_2']
application_train['EXT_SOURCE_1OVER3'] = application_train['EXT_SOURCE_1']/application_train['EXT_SOURCE_3']
application_train['EXT_SOURCE_2OVER1'] = application_train['EXT_SOURCE_2']/application_train['EXT_SOURCE_1']
application_train['EXT_SOURCE_2OVER3'] = application_train['EXT_SOURCE_2']/application_train['EXT_SOURCE_3']
application_train['EXT_SOURCE_3OVER2'] = application_train['EXT_SOURCE_3']/application_train['EXT_SOURCE_2']
application_train['EXT_SOURCE_3OVER1'] = application_train['EXT_SOURCE_3']/application_train['EXT_SOURCE_1']
In [364]:
all_data['EXT_SOURCE_1OVER2'] = all_data['EXT_SOURCE_1']/all_data['EXT_SOURCE_2']
all_data['EXT_SOURCE_1OVER3'] = all_data['EXT_SOURCE_1']/all_data['EXT_SOURCE_3']
all_data['EXT_SOURCE_2OVER1'] = all_data['EXT_SOURCE_2']/all_data['EXT_SOURCE_1']
all_data['EXT_SOURCE_2OVER3'] = all_data['EXT_SOURCE_2']/all_data['EXT_SOURCE_3']
all_data['EXT_SOURCE_3OVER2'] = all_data['EXT_SOURCE_3']/all_data['EXT_SOURCE_2']
all_data['EXT_SOURCE_3OVER1'] = all_data['EXT_SOURCE_3']/all_data['EXT_SOURCE_1']

3d. Inquiries and domain-specific ratios

In [132]:
# Create a total inquiries feature
application_train['NUM_INQ_TOT'] = application_train[['AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR']].sum(axis=1)
In [133]:
plt.figure(figsize=(18,6))
plt.title('Distribution of NUM_INQ_TOT')
sns.stripplot(x='NUM_INQ_TOT', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
In [145]:
# Create a NUM_INQ_TOT bucket feature
application_train['NUM_INQ_TOT_BUCKETS'] = pd.qcut(application_train['NUM_INQ_TOT'], 3, labels=np.arange(0,3))

label = LabelEncoder()
application_train['NUM_INQ_TOT_BUCKETS'] = label.fit_transform(application_train['NUM_INQ_TOT_BUCKETS'])

sns.countplot(application_train['NUM_INQ_TOT_BUCKETS'], palette='viridis')
plt.show()
In [135]:
sns.countplot(x='Default', hue='NUM_INQ_TOT_BUCKETS', data=application_train, palette='viridis')
plt.show()
In [365]:
all_data['NUM_INQ_TOT'] = all_data[['AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR']].sum(axis=1)

all_data['NUM_INQ_TOT_BUCKETS'] = pd.qcut(all_data['NUM_INQ_TOT'], 3, labels=np.arange(0,3))

label = LabelEncoder()
all_data['NUM_INQ_TOT_BUCKETS'] = label.fit_transform(all_data['NUM_INQ_TOT_BUCKETS'])

sns.countplot(all_data['NUM_INQ_TOT_BUCKETS'], palette='viridis')
plt.show()
In [136]:
application_train['INC_CRED_RATIO'] = application_train['AMT_CREDIT']/application_train['AMT_INCOME_TOTAL']

These ratio variables, which perform quite well as a function of variable importance when training models, are motivated more by intuition about their relationship than just simple curiosity about the value of transformations like in the non-linear external source variables above.

I would contend that these ratios provide information above and beyond what these variables can provide on their own. 1) It follows that the ratio of a borrower's income to the size of the loan would provide information about the likelihood of default. A large loan wouldn't necessarily be indicative of higher rates of default if individuals had income far greater than loan size. As such, the size of the loan relative to the borrower's income likely provides valuable information for the model; 2) A short employment history is likely more valuable for predicting default in relation to how old the borrower is. Conversely, older borrowers with poor/shallow employment histories are likely less likely to repay loans, so the ratio of days employed to the age of the borrower is a sensible feature for our model; 3) The size of the annuity on the loan (i.e. payments) relative to both the size of the loan itself and the individuals income is likely highly indicative of how likely an individual is to repay that loan. Large loans with small annuities are likely much more often paid down than loans that require extremely high payments.

In [440]:
all_data['INC_CRED_RATIO'] = all_data['AMT_CREDIT']/all_data['AMT_INCOME_TOTAL']
all_data['EMP_AGE_RATIO'] = all_data['DAYS_EMPLOYED']/all_data['DAYS_BIRTH']
all_data['INC_ANN_RATIO'] = all_data['AMT_ANNUITY']/all_data['AMT_INCOME_TOTAL']
all_data['CRED_ANN_RATIO'] = all_data['AMT_ANNUITY']/all_data['AMT_CREDIT']

4. Assemble dataset

Since all of the pre-processing is completed, the model must be trained on only the training data with target values. I separate things out here and proceed with the datasets separately

In [441]:
all_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 142 entries, index to CRED_ANN_RATIO
dtypes: float64(78), int64(51), object(13)
memory usage: 386.0+ MB
In [442]:
train_ = all_data[:t_obs]
test_ = all_data[t_obs:]
In [443]:
train_.reset_index(inplace=True)
test_.reset_index(inplace=True)
In [444]:
train_.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 143 entries, level_0 to CRED_ANN_RATIO
dtypes: float64(78), int64(52), object(13)
memory usage: 335.5+ MB
In [445]:
test_.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 143 entries, level_0 to CRED_ANN_RATIO
dtypes: float64(78), int64(52), object(13)
memory usage: 53.2+ MB

4a. Feature selection

Several iterations of this notebook and the models below led to this selection of features for model fitting. Given the variables I had to work with, both in the initial dataset and as a result of feature engineering, this selection seemed to perform the best the models chose to work with

In [548]:
# Select features for training
X_full = pd.concat([train_[['TARGET',
                            'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 
                            'AMT_GOODS_BUCKETS', 'AMT_ANNUITY_BUCKETS', 'AMT_CREDIT_BUCKETS', 'AMT_INCOME_TOTAL_BUCKETS',
                            'DAYS_EMPLOYED_BUCKETS', 'DAYS_REGISTRATION_BUCKETS', 'DAYS_ID_PUBLISH_BUCKETS',
                            'DAYS_BIRTH',
                            'NUM_INQ_TOT_BUCKETS', 
                            'INC_CRED_RATIO', 'EMP_AGE_RATIO', 'INC_ANN_RATIO', 'CRED_ANN_RATIO',
                            'REGION_RATING_CLIENT', 'REGION_POPULATION_RELATIVE',
                            'EXT_SOURCE_1OVER2', 'EXT_SOURCE_1OVER3', 'EXT_SOURCE_2OVER1',
                            'EXT_SOURCE_2OVER3', 'EXT_SOURCE_3OVER2', 'EXT_SOURCE_3OVER1',
                          ]],
                   all_multi_encs.iloc[:t_obs,0:3]], axis=1)

4b. Downsampling

Given the class-imbalance in this dataset, one additional source of tuning I used in fitting my model was the downsampling procedure detailed below. With so few positive classes in the target variable, I found that my models were predicting 0 (i.e. did not default) for almost every borrower. Since there are so few 1's (i.e. defaults), a dummy-predictor that assigns the negative class to every observation will still score high in terms of accuracy (92% based on the imbalance we found in the EDA above). In order to focus the models more acutely on the positive classes, I opted to downsample the data by randomly selecting a given number of rows from the negative classes to establish a user-specified ratio of negative to positive classes. It's easy to get carried away with this, since we'd still like to represent the underlying distribution of defaults in the population, so downsampling to something like 50/50 would be inappropriate. Through my iterations of fitting models on the data, and submitting to the public leaderboards, I found that a 20/80 split seemed to perform better than other downsampled training sets.

The block of code below, given this target balancing, randomly selects enough observations from the negative-class portion of the data such that in the final training set, 80% are from the negative class, and the original positive-class data make up the remaining 20%. In terms of evaluation, I did not focus on accuracy, and instead paid closer attention to confusion matrices and AUC since these are far more reliable at assessing how these models are actually doing than accuracy

In [549]:
# Specify original index numbers
X_full['orig_index'] = X_full.index

# Separate out positive and negative classes to rebalance negative class
pos_class_sample = X_full[X_full['TARGET']==1]
neg_class_sample = X_full[X_full['TARGET']==0]

pos_class_sample.reset_index(inplace=True)
n_pos = pos_class_sample.shape[0]
print(n_pos)

neg_class_sample.reset_index(inplace=True)

target_balance = 0.20
balance_factor = (1/target_balance)-1
n_select_neg = int(n_pos*balance_factor)
print(n_select_neg)

n_rows = np.arange(neg_class_sample.shape[0])
neg_idx = np.random.choice(n_rows, n_select_neg)

neg_class_sample = neg_class_sample.iloc[neg_idx]

sampled_stack = pd.concat([pos_class_sample, neg_class_sample], axis=0)
X_sampled = sampled_stack.sort_values('orig_index').drop('orig_index', axis=1).reset_index()

X_ = X_sampled.drop('TARGET', axis=1)
y = X_sampled['TARGET']
24825
99300
In [546]:
X_.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99300 entries, 0 to 99299
Data columns (total 29 columns):
level_0                       99300 non-null int64
index                         99300 non-null int64
EXT_SOURCE_1                  99300 non-null float64
EXT_SOURCE_2                  99300 non-null float64
EXT_SOURCE_3                  99300 non-null float64
AMT_GOODS_BUCKETS             99300 non-null int64
AMT_ANNUITY_BUCKETS           99300 non-null int64
AMT_CREDIT_BUCKETS            99300 non-null int64
AMT_INCOME_TOTAL_BUCKETS      99300 non-null int64
DAYS_EMPLOYED_BUCKETS         99300 non-null int64
DAYS_REGISTRATION_BUCKETS     99300 non-null int64
DAYS_ID_PUBLISH_BUCKETS       99300 non-null int64
DAYS_BIRTH                    99300 non-null int64
NUM_INQ_TOT_BUCKETS           99300 non-null int64
INC_CRED_RATIO                99300 non-null float64
EMP_AGE_RATIO                 99300 non-null float64
INC_ANN_RATIO                 99300 non-null float64
CRED_ANN_RATIO                99300 non-null float64
REGION_RATING_CLIENT          99300 non-null int64
REGION_POPULATION_RELATIVE    99300 non-null float64
EXT_SOURCE_1OVER2             99300 non-null float64
EXT_SOURCE_1OVER3             99300 non-null float64
EXT_SOURCE_2OVER1             99300 non-null float64
EXT_SOURCE_2OVER3             99300 non-null float64
EXT_SOURCE_3OVER2             99300 non-null float64
EXT_SOURCE_3OVER1             99300 non-null float64
CODE_GENDER_CLASS_0           99300 non-null float64
CODE_GENDER_CLASS_1           99300 non-null float64
CODE_GENDER_CLASS_2           99300 non-null float64
dtypes: float64(17), int64(12)
memory usage: 22.0 MB
In [550]:
# Scale features prior to estimation
from sklearn.preprocessing import StandardScaler

X_.drop(['index', 'level_0'], axis=1, inplace=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_)
X = pd.DataFrame(X_scaled)
X.columns  = X_.columns

X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124125 entries, 0 to 124124
Data columns (total 27 columns):
EXT_SOURCE_1                  124125 non-null float64
EXT_SOURCE_2                  124125 non-null float64
EXT_SOURCE_3                  124125 non-null float64
AMT_GOODS_BUCKETS             124125 non-null float64
AMT_ANNUITY_BUCKETS           124125 non-null float64
AMT_CREDIT_BUCKETS            124125 non-null float64
AMT_INCOME_TOTAL_BUCKETS      124125 non-null float64
DAYS_EMPLOYED_BUCKETS         124125 non-null float64
DAYS_REGISTRATION_BUCKETS     124125 non-null float64
DAYS_ID_PUBLISH_BUCKETS       124125 non-null float64
DAYS_BIRTH                    124125 non-null float64
NUM_INQ_TOT_BUCKETS           124125 non-null float64
INC_CRED_RATIO                124125 non-null float64
EMP_AGE_RATIO                 124125 non-null float64
INC_ANN_RATIO                 124125 non-null float64
CRED_ANN_RATIO                124125 non-null float64
REGION_RATING_CLIENT          124125 non-null float64
REGION_POPULATION_RELATIVE    124125 non-null float64
EXT_SOURCE_1OVER2             124125 non-null float64
EXT_SOURCE_1OVER3             124125 non-null float64
EXT_SOURCE_2OVER1             124125 non-null float64
EXT_SOURCE_2OVER3             124125 non-null float64
EXT_SOURCE_3OVER2             124125 non-null float64
EXT_SOURCE_3OVER1             124125 non-null float64
CODE_GENDER_CLASS_0           124125 non-null float64
CODE_GENDER_CLASS_1           124125 non-null float64
CODE_GENDER_CLASS_2           124125 non-null float64
dtypes: float64(27)
memory usage: 25.6 MB

5. Modelling

Admittedly, I spent much, much more time in this competition/notebook pre-processing and engineering features than modeling. One reason for this: my laptop had a fair amount of difficulty fitting models on this dataset. While the downsamping helped considerably, it took some time for my equipment to fit models, so I wasn't able to iterate as rapidly with running models with different tuning parameters than I would have liked (which further necessitated the use of the LightGBM model below). In order to get a sense of a baseline, I used k-Nearest Neighbors and to quickly get a sense of variable importances, I used Gradient Boosting as I was working through feature selection and engineering

In [535]:
# Split 
from sklearn.model_selection import train_test_split, cross_validate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)
In [451]:
from sklearn.metrics import confusion_matrix, classification_report

5a. kNN

In [185]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

##_ = cross_validate(knn, X_train, y_train,cv=5)['test_score']
##print("% Mean Accuracy: {}".format(np.array(_).mean()))

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[30572  2106]
 [ 6769  1454]]
             precision    recall  f1-score   support

          0       0.82      0.94      0.87     32678
          1       0.41      0.18      0.25      8223

avg / total       0.74      0.78      0.75     40901

5b. GBC

In [474]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()

gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

for row in sorted(list(zip(X_train.columns.tolist(), gbc.feature_importances_.tolist())), key=lambda x:x[1], reverse=True):
    print(row)
[[32007   756]
 [ 7019  1180]]
             precision    recall  f1-score   support

        0.0       0.82      0.98      0.89     32763
        1.0       0.61      0.14      0.23      8199

avg / total       0.78      0.81      0.76     40962

('CRED_ANN_RATIO', 0.1941860218030331)
('EXT_SOURCE_3', 0.14616254049109526)
('EXT_SOURCE_2', 0.13947702906532883)
('EXT_SOURCE_1', 0.1168937939973662)
('DAYS_BIRTH', 0.08452613383778759)
('EMP_AGE_RATIO', 0.04858924882032566)
('AMT_GOODS_BUCKETS', 0.04586961726222063)
('INC_ANN_RATIO', 0.03559809104919816)
('AMT_ANNUITY_BUCKETS', 0.031225239234614937)
('DAYS_ID_PUBLISH_BUCKETS', 0.029607255452156304)
('INC_CRED_RATIO', 0.021504342562178192)
('CODE_GENDER_CLASS_1', 0.021120362062858132)
('DAYS_EMPLOYED_BUCKETS', 0.019640399918516054)
('AMT_CREDIT_BUCKETS', 0.019057042594577734)
('REGION_RATING_CLIENT', 0.014095489068423648)
('REGION_POPULATION_RELATIVE', 0.010437371699728873)
('AMT_INCOME_TOTAL_BUCKETS', 0.009766458309302216)
('CODE_GENDER_CLASS_0', 0.004523250537517899)
('NUM_INQ_TOT_BUCKETS', 0.00410574314168401)
('DAYS_REGISTRATION_BUCKETS', 0.003614569092086684)
('CODE_GENDER_CLASS_2', 0.0)

Prepare test dataset

In [551]:
# Select features for training
X_totest = pd.concat([test_[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 
                             'AMT_GOODS_BUCKETS', 'AMT_ANNUITY_BUCKETS', 'AMT_CREDIT_BUCKETS', 'AMT_INCOME_TOTAL_BUCKETS',
                             'DAYS_EMPLOYED_BUCKETS', 'DAYS_REGISTRATION_BUCKETS', 'DAYS_ID_PUBLISH_BUCKETS',
                             'DAYS_BIRTH',
                             'NUM_INQ_TOT_BUCKETS', 
                             'INC_CRED_RATIO', 'EMP_AGE_RATIO', 'INC_ANN_RATIO', 'CRED_ANN_RATIO',
                             'REGION_RATING_CLIENT', 'REGION_POPULATION_RELATIVE',
                             'EXT_SOURCE_1OVER2', 'EXT_SOURCE_1OVER3', 'EXT_SOURCE_2OVER1',
                             'EXT_SOURCE_2OVER3', 'EXT_SOURCE_3OVER2', 'EXT_SOURCE_3OVER1',
                          ]], all_multi_encs.iloc[t_obs:,0:3].reset_index().drop('index',axis=1)], axis=1)
In [552]:
# Scale features prior to estimation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_totest)
X_test = pd.DataFrame(X_scaled)
X_test.columns  = X_totest.columns

LightGBM

Getting LightGBM up and running for this competition was a breakthrough, and I credit the fantastic individual who put together this kernel for helping me get this up and running.

In [486]:
import lightgbm as lgb
In [553]:
# Create more usefully named variants of existing feature Dataframes
features = X
labels = y
test_features = X_test
In [554]:
# Convert to NumPy arrays - store feature names
feature_names = features.columns.tolist()
features = np.array(features)
test_features = np.array(test_features)
In [555]:
from sklearn.model_selection import KFold

# Create the kfold object
k_fold = KFold(n_splits = 5, shuffle = True, random_state = 101)
In [556]:
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
    
# Empty array for test predictions
test_predictions = np.zeros(test_features.shape[0])
    
# Empty array for out of fold validation predictions
out_of_fold = np.zeros(features.shape[0])
    
# Lists for recording validation and training scores
valid_scores = []
train_scores = []
In [557]:
# Iterate through each fold
for train_indices, valid_indices in k_fold.split(features):
        
    # Training data for the fold
    train_features, train_labels = features[train_indices], labels[train_indices]
    #Validation data for the fold
    valid_features, valid_labels = features[valid_indices], labels[valid_indices]
    
    # Create the bst
    bst = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                            class_weight = 'balanced', learning_rate = 0.05, 
                            reg_alpha = 0.1, reg_lambda = 0.1, 
                            subsample = 0.8, random_state = 101)
        
    # Train the bst
    bst.fit(train_features, train_labels, eval_metric = 'auc',
            eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
            eval_names = ['valid', 'train'],
            early_stopping_rounds = 100, verbose = 200)
    
    # Record the best iteration
    best_iteration = bst.best_iteration_
        
    # Record the feature importances
    feature_importance_values += bst.feature_importances_ / k_fold.n_splits
        
    # Make predictions
    test_predictions += bst.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
    # Record the out of fold predictions
    out_of_fold[valid_indices] = bst.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
    # Record the best score
    valid_score = bst.best_score_['valid']['auc']
    train_score = bst.best_score_['train']['auc']
        
    valid_scores.append(valid_score)
    train_scores.append(train_score)    
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.75444	train's auc: 0.795669
[400]	valid's auc: 0.756467	train's auc: 0.82712
[600]	valid's auc: 0.757756	train's auc: 0.852899
[800]	valid's auc: 0.758772	train's auc: 0.874441
Early stopping, best iteration is:
[788]	valid's auc: 0.758863	train's auc: 0.873269
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.759216	train's auc: 0.794844
[400]	valid's auc: 0.761445	train's auc: 0.825947
[600]	valid's auc: 0.761625	train's auc: 0.85158
Early stopping, best iteration is:
[637]	valid's auc: 0.761917	train's auc: 0.855891
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.761492	train's auc: 0.794511
[400]	valid's auc: 0.763792	train's auc: 0.826541
[600]	valid's auc: 0.764906	train's auc: 0.851752
[800]	valid's auc: 0.765387	train's auc: 0.873426
[1000]	valid's auc: 0.765852	train's auc: 0.892151
Early stopping, best iteration is:
[999]	valid's auc: 0.765862	train's auc: 0.892064
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.764495	train's auc: 0.79347
[400]	valid's auc: 0.765426	train's auc: 0.825469
Early stopping, best iteration is:
[372]	valid's auc: 0.765564	train's auc: 0.821399
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.758703	train's auc: 0.795766
[400]	valid's auc: 0.759832	train's auc: 0.826954
[600]	valid's auc: 0.760387	train's auc: 0.852869
Early stopping, best iteration is:
[591]	valid's auc: 0.760449	train's auc: 0.851905
In [558]:
final_predictions = pd.DataFrame({'SK_ID_CURR': test['SK_ID_CURR'], 'TARGET': test_predictions})
final_predictions.to_csv('default_predictions.csv', index=False)