Home Credit Default Risk is my first 'real' Kaggle competition submission. Since the Titantic dataset is so ubiquitous, and I got some help in establishing the initial structure from my MOOC resources, it didn't provide quite the same challenge as a legitimate competition would. Home Credit asked kagglers to predict whether or not an individual would pay back, or default on, a loan with the underlying mission of broadening financial inclusion to the unbanked community.
This is a larger dataset (though not huge), with approximately 300,000 observations in the training data and 50,000 in the test set, which also offered the possibility of using a number of auxiliary datasets. The primary training dataset included observations at the individual borrower level, while the supplementary datasets provided information at coarser units of observation. Given my time constraints when this competition was live (the Summer of 2018), I focused on doing the best I could with the main dataset, and did not spend the time incorporating the supplemental data. Moreover, since I didn't have time to annotate this notebook in a way I felt did the work here justice prior to competition close in August, I didn't submit a formal kernel. This notebook details the work I did on this competition end-to-end. While I didn't place particularly high in the public leaderboard, I learned a lot by struggling through this, and am happy with the progress I made working through things.
# Import necessary libraries for data preparation/EDA
import os
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
sns.set_style('darkgrid')
# Set directory and import application training data
os.chdir(r'C:\Users\Jared\Documents\datasci\projects\kaggle\home_credit_default')
application_train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')
application_train.info()
I begin by stacking both the test and training sets on top of one another. Again, I acknowledge there are some potential issues with this, first and foremost being data leakage - obviously, using test data along with training data for pre-processing potentially injects information contained in the test data that we wouldn't typically be able to observe. Since this represents something of a full population provided by Kaggle, instances where I deal with things like mean/median imputing for missing values benefit from using the full sample. By concatenating, I'm able to separate out the test data from the training data when I actually conduct my modelling steps further on. It makes cleaning the data much easier, and provides more realistic estimates of the means and medians in these missing values than treating them separately.
# Stack train and test data
t_obs = application_train.shape[0]
print(t_obs)
test['TARGET'] = np.nan
test = test[application_train.columns.tolist()]
all_data = pd.concat([application_train, test], axis=0)
all_data.reset_index(inplace=True)
all_data.info()
I chose to essentially work with both my training dataset and the 'all_data' stacked data set concurrently in my exploratory data analysis and cleaning. I like to visualize the data I'm working with with respect to the target- naturally, that's not feasible with the stacked dataset which lacks the target, so both datasets will come into play here
A quick countplot of the target dataset makes it clear that the dataset we're dealing with here is fairly imbalanced, which has some ramifications for how we evaluate/train our models (more on this below)
# Countplot of the target variable indicates how imbalanced this dataset is
sns.countplot('TARGET', data=application_train, palette='viridis')
plt.show()
print('% Positive Class: {}'.format(100*round(application_train['TARGET'].mean(),4)))
# Visualize missing data - determine the extent of missings in the dataset
# Not super informative with this many features
sns.heatmap(application_train.isnull(), yticklabels=False, xticklabels=False, cbar=False, cmap='viridis')
plt.show()
# Create a dataframe with information on these missing values
tot_msg = application_train.isnull().sum(axis=0)
pct_msg = tot_msg/all_data.shape[0]
msg_info = pd.concat([tot_msg, pct_msg], axis=1)
msg_info.columns = ['# Missing Obs.', '% Missing Obs.']
msg_info[2:62]
msg_info[63:122]
# In order to speed up some processing, remove the housing summary statistics that are predominantly missing from the data
to_drop_ = all_data.columns[45:92].tolist()
all_data.drop(to_drop_, axis=1, inplace=True)
all_data.info()
These variables represent what I would consider the foundation of how decisions to extend loans are made. It doesn't take much imagination or institutional knowledge to figure that the amount of income, credit, annuity, and employment variables would likely be highly important for determining whether or not an individual is able to pay back a loan. Here I explore the distributions of these variables, relative to the target, and uncover some idiosyncrasies about how they've been coded in the dataset
application_train['Default'] = np.where(application_train['TARGET'] == 1, "Defaulted", "Paid")
# Look at distribution of AMT_INCOME_TOTAL
plt.figure(figsize=(18,6))
plt.title('Distribution of AMT_INCOME_TOTAL, by Default Status')
sns.stripplot(x='AMT_INCOME_TOTAL', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
# Given the clear outliers toward the end of the distribution, remove these
application_train['AMT_INCOME_TOTAL'].max()
_ = application_train[application_train['AMT_INCOME_TOTAL'] != 1.170000e+08]
plt.figure(figsize=(18,6))
plt.title('Distribution of AMT_INCOME_TOTAL, by Default Status - No Outliers')
sns.stripplot(x='AMT_INCOME_TOTAL', y='Default', data=_, jitter=1, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_INCOME_TOTAL - No Outliers, Individual Defaults")
sns.distplot(_[_['TARGET']==1].AMT_INCOME_TOTAL)
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_INCOME_TOTAL - No Outliers, Individual Doesn't Default")
sns.distplot(_[_['TARGET']==0].AMT_INCOME_TOTAL)
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_CREDIT")
sns.stripplot(x='AMT_CREDIT', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
fig, ax = plt.subplots(figsize=(18,6))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].AMT_CREDIT, label=labels_[target])
plt.legend()
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_ANNUITY")
sns.stripplot(x='AMT_ANNUITY', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].AMT_ANNUITY.dropna(), label=labels_[target])
plt.legend()
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of AMT_GOODS_PRICE")
sns.stripplot(x='AMT_GOODS_PRICE', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].AMT_GOODS_PRICE.dropna(), label=labels_[target])
plt.legend()
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of DAYS_EMPLOYED")
sns.stripplot(x='DAYS_EMPLOYED', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
# Given the clear outliers toward the end of the distribution, remove these
application_train['DAYS_EMPLOYED'].max()
_ = application_train[application_train['DAYS_EMPLOYED'] != 365243]
plt.figure(figsize=(18,6))
plt.title('Distribution of DAYS_EMPLOYED, by Default Status - No Outliers')
sns.stripplot(x='DAYS_EMPLOYED', y='Default', data=_, jitter=1, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
plt.title("Distribution of DAYS_BIRTH")
sns.stripplot(x='DAYS_BIRTH', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
While there isn't a whole lot of background provided on how these external source variables are constructed, they VERY reliably top variable importance figures, and very clearly exhibit different distributions relative to the target variable, as the distribution plots here so. Based on some of the background from the competition, these are essentially scores from third-parties used to determine credit-worthiness of applicants. It follows that these should contain some predictive power with respect to whether or not a borrower defaults, so I suppose it is comforting/unexpected that they perform so well
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_1.dropna(), label=labels_[target])
plt.legend()
plt.show()
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_2.dropna(), label=labels_[target])
plt.legend()
plt.show()
fig, ax = plt.subplots(figsize=(18,10))
labels_ = ['Defaulted', 'Paid']
for target in [1,0]:
sns.distplot(application_train[application_train['TARGET']==target].EXT_SOURCE_3.dropna(), label=labels_[target])
plt.legend()
plt.show()
I was also curious about whether or not these external scores could represent transformations of some of the other variables in our dataset. It turns out that these external source variables are not highly correlated with the other variables in the dataset (the highest beign the 0.26 correlation between the first external source variable and the credit/goods price features). This further suggests that they add additional explantory power above and beyond what we can gather with the other features in the dataset
ext_and_amts = application_train[['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_ANNUITY',
'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'Default']]
plt.figure(figsize=(10,6))
sns.heatmap(ext_and_amts.drop('Default',axis=1).corr(), cmap='viridis', annot=True)
plt.title("Correlations between 'AMT' and 'EXT_SOURCE' features")
plt.show()
plt.figure(figsize=(10,10))
sns.pairplot(ext_and_amts.dropna(), hue='Default')
plt.show()
sns.countplot('TARGET', hue='CODE_GENDER', data=application_train, palette='viridis')
plt.show()
sns.countplot('TARGET', hue='FLAG_OWN_REALTY', data=application_train, palette='viridis')
plt.show()
sns.countplot('TARGET', hue='FLAG_OWN_CAR', data=application_train, palette='viridis')
plt.show()
sns.countplot('TARGET', hue='NAME_CONTRACT_TYPE', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_HOUSING_TYPE', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_TYPE_SUITE', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_INCOME_TYPE', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_EDUCATION_TYPE', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='NAME_FAMILY_STATUS', data=application_train, palette='viridis')
plt.show()
plt.figure(figsize=(18,6))
sns.countplot('TARGET', hue='OCCUPATION_TYPE', data=application_train.dropna(), palette='viridis')
plt.show()
I identified some of the idiosyncratic outliers in the dataset through my explanatory analysis above, and found two others trolling through other kernels in the competition. Since I did not end up using these features, I did not remove the strangely coded outliers/missing values therein. Moreover, rather than just dropping the outliers (as I have done in the application_train cut of the data), I replace them with null/NA values and then impute them, since all the test data must be present to generate predictions that can actually be scored
# Remove outliers per the findings above and determined from other kernels in the training data alone
application_train = application_train[application_train['AMT_INCOME_TOTAL'] != 1.170000e+08]
application_train = application_train[application_train['AMT_REQ_CREDIT_BUREAU_QRT'] != 261] # From other kernels
application_train = application_train[application_train['OBS_30_CNT_SOCIAL_CIRCLE'] < 300] # From other kernels
application_train['DAYS_EMPLOYED'] = (application_train['DAYS_EMPLOYED'].apply(lambda x: x if x != 365243 else np.nan))
# Remove outliers per the findings above and determined from other kernels in the full sample dataset
all_data['DAYS_EMPLOYED'] = (all_data['DAYS_EMPLOYED'].apply(lambda x: x if x != 365243 else np.nan))
all_data['AMT_INCOME_TOTAL'] = (all_data['AMT_INCOME_TOTAL'].apply(lambda x: x if x != 1.170000e+08 else np.nan))
#Since these feel more like outliers than encoded missing values, I replace them with the max value in the feature
all_data['AMT_INCOME_TOTAL'][all_data['AMT_INCOME_TOTAL'].isnull()] = all_data['AMT_INCOME_TOTAL'].dropna().max()
There are a fair amount of categorical variables in the data that need to be appropriately encoded so we can use them in the machine learning models below. Binary variables are simply pre-processed with the LabelEncoder, and OneHotEncoding is used on the categorical variables with multiple classes.
# Using LabelEncoder for our binary categoricals
from sklearn.preprocessing import LabelEncoder
binary_cats = ['FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE']
for cat in binary_cats:
label = LabelEncoder()
application_train[cat] = label.fit_transform(application_train[cat])
print(application_train[binary_cats].info())
for cat in binary_cats:
label = LabelEncoder()
all_data[cat] = label.fit_transform(all_data[cat])
print(all_data[binary_cats].info())
# Using OneHotEncoder for multi-category categoricals
from sklearn.preprocessing import OneHotEncoder
# First impute majority class for null values in 'NAME_TYPE_SUITE'
application_train['NAME_TYPE_SUITE'][application_train['NAME_TYPE_SUITE'].isnull()] = 'Unaccompanied'
application_train['OCCUPATION_TYPE'][application_train['OCCUPATION_TYPE'].isnull()] = 'None'
all_data['NAME_TYPE_SUITE'][all_data['NAME_TYPE_SUITE'].isnull()] = 'Unaccompanied'
all_data['OCCUPATION_TYPE'][all_data['OCCUPATION_TYPE'].isnull()] = 'None'
multi_cats = ['CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE',
'ORGANIZATION_TYPE', 'OCCUPATION_TYPE']
feat_matrices = []
for cat in multi_cats:
# Convert to integer
label = LabelEncoder()
label_encoded = label.fit_transform(all_data[cat])
# Convert to multi-category with OHE
ohe = OneHotEncoder(sparse=False)
label_encoded = label_encoded.reshape(len(label_encoded),1)
ohe_encoded = ohe.fit_transform(label_encoded)
# Formatting to concatenate with all_data
ohe_df = pd.DataFrame(ohe_encoded)
n_cats = ohe_df.columns.tolist()
cat_names = [cat + "_CLASS_" + str(i) for i in n_cats]
ohe_df.columns = cat_names
feat_matrices.append(ohe_df)
all_multi_encs = pd.concat(feat_matrices, axis=1)
all_multi_encs.iloc[:,1:3].head()
Based on my experience with some other competitions and several iterations of training models in this competition, creating coarser, binned variants of the continuous variables generally improves performance. I opted to create bins, with cuts based on quantiles, for all of the 'key' (read as, variables that make intuitive sense to include in the model) continuous variables.
Again, I perform these cleaning with both the training dataset alone (to visualize the bins relative to the target variable) and with respect to the training data. This also allows us to perform more granular imputation of missing values in the external source variables which are missing a fair amount of values.
# Create a AMT_ANNUITY bucket feature
application_train['AMT_ANNUITY'][application_train['AMT_ANNUITY'].isnull()] = application_train['AMT_ANNUITY'].median()
application_train['AMT_ANNUITY_BUCKETS'] = pd.qcut(application_train['AMT_ANNUITY'], 5, labels=np.arange(0,5))
sns.countplot(application_train['AMT_ANNUITY_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='AMT_ANNUITY_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['AMT_ANNUITY'][all_data['AMT_ANNUITY'].isnull()] = all_data['AMT_ANNUITY'].median()
all_data['AMT_ANNUITY_BUCKETS'] = pd.qcut(all_data['AMT_ANNUITY'], 5, labels=np.arange(0,5))
# Create a AMT_INCOME_TOTAL bucket feature
application_train['AMT_INCOME_TOTAL_BUCKETS'] = pd.qcut(application_train['AMT_INCOME_TOTAL'], 7, labels=np.arange(0,7))
sns.countplot(application_train['AMT_INCOME_TOTAL_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='AMT_INCOME_TOTAL_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['AMT_INCOME_TOTAL_BUCKETS'] = pd.qcut(all_data['AMT_INCOME_TOTAL'], 7, labels=np.arange(0,7))
# Create a AMT_CREDIT bucket feature
application_train['AMT_CREDIT_BUCKETS'] = pd.qcut(application_train['AMT_CREDIT'], 4, labels=np.arange(0,4))
sns.countplot(application_train['AMT_CREDIT_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='AMT_CREDIT_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['AMT_CREDIT_BUCKETS'] = pd.qcut(all_data['AMT_CREDIT'], 4, labels=np.arange(0,4))
# Since we're missing values here and the correlation between this and credit is really high, impute missings by those buckets
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_CREDIT_BUCKETS', y='AMT_GOODS_PRICE', data=application_train ,palette='viridis')
plt.show()
# Impute the missing values in AMT_GOODS_PRICE based on credit buckets
AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS = application_train.groupby('AMT_CREDIT_BUCKETS').median()['AMT_GOODS_PRICE']
print(AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS)
application_train['AMT_GOODS_PRICE'] = np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 0), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[0],
np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 1), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[1],
np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 2), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[2],
np.where((application_train['AMT_GOODS_PRICE'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 3), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[3],
application_train['AMT_GOODS_PRICE']))))
application_train['AMT_GOODS_BUCKETS'] = pd.qcut(application_train['AMT_GOODS_PRICE'], 6, labels=np.arange(0,6))
sns.countplot(application_train['AMT_GOODS_BUCKETS'], palette='viridis')
plt.show()
sns.countplot('Default', hue='AMT_GOODS_BUCKETS', data=application_train, palette='viridis')
plt.show()
# Impute the missing values in AMT_GOODS_PRICE based on credit buckets
AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS = all_data.groupby('AMT_CREDIT_BUCKETS').median()['AMT_GOODS_PRICE']
print(AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS)
all_data['AMT_GOODS_PRICE'] = np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 0), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[0],
np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 1), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[1],
np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 2), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[2],
np.where((all_data['AMT_GOODS_PRICE'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 3), AMT_GOODS_PRICE_by_AMT_CREDIT_BUCKETS.iloc[3],
all_data['AMT_GOODS_PRICE']))))
all_data['AMT_GOODS_BUCKETS'] = pd.qcut(all_data['AMT_GOODS_PRICE'], 6, labels=np.arange(0,6))
days_feats = application_train[['DAYS_EMPLOYED', 'DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']]
print(days_feats.info())
plt.figure(figsize=(10,6))
sns.heatmap(days_feats.corr(), cmap='viridis', annot=True)
plt.title("Correlations between 'DAYS' features")
plt.show()
# Create a DAYS_REGISTRATION bucket feature
application_train['DAYS_REGISTRATION_BUCKETS'] = pd.qcut(application_train['DAYS_REGISTRATION'], 4, labels=np.arange(0,4))
sns.countplot(application_train['DAYS_REGISTRATION_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='DAYS_REGISTRATION_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['DAYS_REGISTRATION_BUCKETS'] = pd.qcut(all_data['DAYS_REGISTRATION'], 4, labels=np.arange(0,4))
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='DAYS_REGISTRATION_BUCKETS', y='DAYS_EMPLOYED', data=application_train ,palette='viridis')
plt.show()
# Impute the missing values in DAYS_EMPLOYED based on credit buckets
DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS = application_train.groupby('DAYS_REGISTRATION_BUCKETS').median()['DAYS_EMPLOYED']
print(DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS)
application_train['DAYS_EMPLOYED'] = np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 0), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[0],
np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 1), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[1],
np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 2), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[2],
np.where((application_train['DAYS_EMPLOYED'].isnull()) & (application_train['DAYS_REGISTRATION_BUCKETS'] == 3), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[3],
application_train['DAYS_EMPLOYED']))))
# Create a DAYS_EMPLOYED bucket feature
application_train['DAYS_EMPLOYED_BUCKETS'] = pd.qcut(application_train['DAYS_EMPLOYED'], 6, labels=np.arange(0,6))
sns.countplot(application_train['DAYS_EMPLOYED_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='DAYS_EMPLOYED_BUCKETS', data=application_train, palette='viridis')
plt.show()
DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS = all_data.groupby('DAYS_REGISTRATION_BUCKETS').median()['DAYS_EMPLOYED']
print(DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS)
all_data['DAYS_EMPLOYED'] = np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 0), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[0],
np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 1), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[1],
np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 2), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[2],
np.where((all_data['DAYS_EMPLOYED'].isnull()) & (all_data['DAYS_REGISTRATION_BUCKETS'] == 3), DAYS_EMPLOYED_by_DAYS_REGISTRATION_BUCKETS.iloc[3],
all_data['DAYS_EMPLOYED']))))
all_data['DAYS_EMPLOYED_BUCKETS'] = pd.qcut(all_data['DAYS_EMPLOYED'], 6, labels=np.arange(0,6))
# Create a DAYS_ID_PUBLISH bucket feature
application_train['DAYS_ID_PUBLISH_BUCKETS'] = pd.qcut(application_train['DAYS_ID_PUBLISH'], 12, labels=np.arange(0,12))
sns.countplot(application_train['DAYS_ID_PUBLISH_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='DAYS_ID_PUBLISH_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['DAYS_ID_PUBLISH_BUCKETS'] = pd.qcut(all_data['DAYS_ID_PUBLISH'], 12, labels=np.arange(0,12))
# Encode buckets to integers
type_series = all_data.dtypes == 'category'
cat_feats = type_series[type_series].index.tolist()
for cat in cat_feats:
label = LabelEncoder()
all_data[cat] = label.fit_transform(all_data[cat])
As I mentioned briefly above, these variables seem exceptionally valuable in predicting home credit default. Since they are missing variables fairly extensively, I opted to impute missing values based on the corresponding key variables that I binned in section 3b. The choice of these variables is determined based on their correlation with the external source variable in question.
# Visualize the external source variable relative to the credit buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_CREDIT_BUCKETS', y='EXT_SOURCE_1', data=application_train ,palette='viridis')
plt.show()
# Impute the missing values in EXT_SOURCE_1 based on credit buckets
EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS = application_train.groupby('AMT_CREDIT_BUCKETS').median()['EXT_SOURCE_1']
print(EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS)
application_train['EXT_SOURCE_1'] = np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 0), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[0],
np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 1), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[1],
np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 2), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[2],
np.where((application_train['EXT_SOURCE_1'].isnull()) & (application_train['AMT_CREDIT_BUCKETS'] == 3), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[3],
application_train['EXT_SOURCE_1']))))
EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS = all_data.groupby('AMT_CREDIT_BUCKETS').median()['EXT_SOURCE_1']
print(EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS)
all_data['EXT_SOURCE_1'] = np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 0), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[0],
np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 1), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[1],
np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 2), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[2],
np.where((all_data['EXT_SOURCE_1'].isnull()) & (all_data['AMT_CREDIT_BUCKETS'] == 3), EXT_SOURCE_1_by_AMT_CREDIT_BUCKETS.iloc[3],
all_data['EXT_SOURCE_1']))))
# Visualize the external source variable relative to the income buckets, which do seem to separate things out somewhat
plt.figure(figsize=(6,6))
sns.boxplot(x='AMT_INCOME_TOTAL_BUCKETS', y='EXT_SOURCE_2', data=application_train ,palette='viridis')
plt.show()
# Impute the missing values in EXT_SOURCE_2 based on INCOME_TOTAL buckets
EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS = application_train.groupby('AMT_INCOME_TOTAL_BUCKETS').median()['EXT_SOURCE_2']
print(EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS)
application_train['EXT_SOURCE_2'] = np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 0), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[0],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 1), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[1],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 2), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[2],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 3), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[3],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 4), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[4],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 5), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[5],
np.where((application_train['EXT_SOURCE_2'].isnull()) & (application_train['AMT_INCOME_TOTAL_BUCKETS'] == 6), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[6],
application_train['EXT_SOURCE_2'])))))))
# Impute the missing values in EXT_SOURCE_2 based on INCOME_TOTAL buckets
EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS = all_data.groupby('AMT_INCOME_TOTAL_BUCKETS').median()['EXT_SOURCE_2']
print(EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS)
all_data['EXT_SOURCE_2'] = np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 0), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[0],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 1), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[1],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 2), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[2],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 3), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[3],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 4), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[4],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 5), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[5],
np.where((all_data['EXT_SOURCE_2'].isnull()) & (all_data['AMT_INCOME_TOTAL_BUCKETS'] == 6), EXT_SOURCE_2_by_AMT_INCOME_TOTAL_BUCKETS.iloc[6],
all_data['EXT_SOURCE_2'])))))))
As the correlation heatmap in the exploratory data analysis would suggest, the third external source variable is not well-correlated with any of the other 'key' variables in the data, so I opted for simple median-imputation here.
# EXT_SOURCE_3 is poorly correlated with all of the other AMT variables, so we will apply median imputing here
application_train['EXT_SOURCE_3'][application_train['EXT_SOURCE_3'].isnull()] = application_train['EXT_SOURCE_3'].median()
all_data['EXT_SOURCE_3'][all_data['EXT_SOURCE_3'].isnull()] = all_data['EXT_SOURCE_3'].median()
I tried to exploit some non-linear combinations/interactions of the external source variables considering how powerful they are in model training, to see if any additional information can be gleaned from these kinds of transformations
application_train['EXT_SOURCE_1OVER2'] = application_train['EXT_SOURCE_1']/application_train['EXT_SOURCE_2']
application_train['EXT_SOURCE_1OVER3'] = application_train['EXT_SOURCE_1']/application_train['EXT_SOURCE_3']
application_train['EXT_SOURCE_2OVER1'] = application_train['EXT_SOURCE_2']/application_train['EXT_SOURCE_1']
application_train['EXT_SOURCE_2OVER3'] = application_train['EXT_SOURCE_2']/application_train['EXT_SOURCE_3']
application_train['EXT_SOURCE_3OVER2'] = application_train['EXT_SOURCE_3']/application_train['EXT_SOURCE_2']
application_train['EXT_SOURCE_3OVER1'] = application_train['EXT_SOURCE_3']/application_train['EXT_SOURCE_1']
all_data['EXT_SOURCE_1OVER2'] = all_data['EXT_SOURCE_1']/all_data['EXT_SOURCE_2']
all_data['EXT_SOURCE_1OVER3'] = all_data['EXT_SOURCE_1']/all_data['EXT_SOURCE_3']
all_data['EXT_SOURCE_2OVER1'] = all_data['EXT_SOURCE_2']/all_data['EXT_SOURCE_1']
all_data['EXT_SOURCE_2OVER3'] = all_data['EXT_SOURCE_2']/all_data['EXT_SOURCE_3']
all_data['EXT_SOURCE_3OVER2'] = all_data['EXT_SOURCE_3']/all_data['EXT_SOURCE_2']
all_data['EXT_SOURCE_3OVER1'] = all_data['EXT_SOURCE_3']/all_data['EXT_SOURCE_1']
# Create a total inquiries feature
application_train['NUM_INQ_TOT'] = application_train[['AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']].sum(axis=1)
plt.figure(figsize=(18,6))
plt.title('Distribution of NUM_INQ_TOT')
sns.stripplot(x='NUM_INQ_TOT', y='Default', data=application_train, jitter=1, palette='viridis')
plt.show()
# Create a NUM_INQ_TOT bucket feature
application_train['NUM_INQ_TOT_BUCKETS'] = pd.qcut(application_train['NUM_INQ_TOT'], 3, labels=np.arange(0,3))
label = LabelEncoder()
application_train['NUM_INQ_TOT_BUCKETS'] = label.fit_transform(application_train['NUM_INQ_TOT_BUCKETS'])
sns.countplot(application_train['NUM_INQ_TOT_BUCKETS'], palette='viridis')
plt.show()
sns.countplot(x='Default', hue='NUM_INQ_TOT_BUCKETS', data=application_train, palette='viridis')
plt.show()
all_data['NUM_INQ_TOT'] = all_data[['AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']].sum(axis=1)
all_data['NUM_INQ_TOT_BUCKETS'] = pd.qcut(all_data['NUM_INQ_TOT'], 3, labels=np.arange(0,3))
label = LabelEncoder()
all_data['NUM_INQ_TOT_BUCKETS'] = label.fit_transform(all_data['NUM_INQ_TOT_BUCKETS'])
sns.countplot(all_data['NUM_INQ_TOT_BUCKETS'], palette='viridis')
plt.show()
application_train['INC_CRED_RATIO'] = application_train['AMT_CREDIT']/application_train['AMT_INCOME_TOTAL']
These ratio variables, which perform quite well as a function of variable importance when training models, are motivated more by intuition about their relationship than just simple curiosity about the value of transformations like in the non-linear external source variables above.
I would contend that these ratios provide information above and beyond what these variables can provide on their own. 1) It follows that the ratio of a borrower's income to the size of the loan would provide information about the likelihood of default. A large loan wouldn't necessarily be indicative of higher rates of default if individuals had income far greater than loan size. As such, the size of the loan relative to the borrower's income likely provides valuable information for the model; 2) A short employment history is likely more valuable for predicting default in relation to how old the borrower is. Conversely, older borrowers with poor/shallow employment histories are likely less likely to repay loans, so the ratio of days employed to the age of the borrower is a sensible feature for our model; 3) The size of the annuity on the loan (i.e. payments) relative to both the size of the loan itself and the individuals income is likely highly indicative of how likely an individual is to repay that loan. Large loans with small annuities are likely much more often paid down than loans that require extremely high payments.
all_data['INC_CRED_RATIO'] = all_data['AMT_CREDIT']/all_data['AMT_INCOME_TOTAL']
all_data['EMP_AGE_RATIO'] = all_data['DAYS_EMPLOYED']/all_data['DAYS_BIRTH']
all_data['INC_ANN_RATIO'] = all_data['AMT_ANNUITY']/all_data['AMT_INCOME_TOTAL']
all_data['CRED_ANN_RATIO'] = all_data['AMT_ANNUITY']/all_data['AMT_CREDIT']
Since all of the pre-processing is completed, the model must be trained on only the training data with target values. I separate things out here and proceed with the datasets separately
all_data.info()
train_ = all_data[:t_obs]
test_ = all_data[t_obs:]
train_.reset_index(inplace=True)
test_.reset_index(inplace=True)
train_.info()
test_.info()
Several iterations of this notebook and the models below led to this selection of features for model fitting. Given the variables I had to work with, both in the initial dataset and as a result of feature engineering, this selection seemed to perform the best the models chose to work with
# Select features for training
X_full = pd.concat([train_[['TARGET',
'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_GOODS_BUCKETS', 'AMT_ANNUITY_BUCKETS', 'AMT_CREDIT_BUCKETS', 'AMT_INCOME_TOTAL_BUCKETS',
'DAYS_EMPLOYED_BUCKETS', 'DAYS_REGISTRATION_BUCKETS', 'DAYS_ID_PUBLISH_BUCKETS',
'DAYS_BIRTH',
'NUM_INQ_TOT_BUCKETS',
'INC_CRED_RATIO', 'EMP_AGE_RATIO', 'INC_ANN_RATIO', 'CRED_ANN_RATIO',
'REGION_RATING_CLIENT', 'REGION_POPULATION_RELATIVE',
'EXT_SOURCE_1OVER2', 'EXT_SOURCE_1OVER3', 'EXT_SOURCE_2OVER1',
'EXT_SOURCE_2OVER3', 'EXT_SOURCE_3OVER2', 'EXT_SOURCE_3OVER1',
]],
all_multi_encs.iloc[:t_obs,0:3]], axis=1)
Given the class-imbalance in this dataset, one additional source of tuning I used in fitting my model was the downsampling procedure detailed below. With so few positive classes in the target variable, I found that my models were predicting 0 (i.e. did not default) for almost every borrower. Since there are so few 1's (i.e. defaults), a dummy-predictor that assigns the negative class to every observation will still score high in terms of accuracy (92% based on the imbalance we found in the EDA above). In order to focus the models more acutely on the positive classes, I opted to downsample the data by randomly selecting a given number of rows from the negative classes to establish a user-specified ratio of negative to positive classes. It's easy to get carried away with this, since we'd still like to represent the underlying distribution of defaults in the population, so downsampling to something like 50/50 would be inappropriate. Through my iterations of fitting models on the data, and submitting to the public leaderboards, I found that a 20/80 split seemed to perform better than other downsampled training sets.
The block of code below, given this target balancing, randomly selects enough observations from the negative-class portion of the data such that in the final training set, 80% are from the negative class, and the original positive-class data make up the remaining 20%. In terms of evaluation, I did not focus on accuracy, and instead paid closer attention to confusion matrices and AUC since these are far more reliable at assessing how these models are actually doing than accuracy
# Specify original index numbers
X_full['orig_index'] = X_full.index
# Separate out positive and negative classes to rebalance negative class
pos_class_sample = X_full[X_full['TARGET']==1]
neg_class_sample = X_full[X_full['TARGET']==0]
pos_class_sample.reset_index(inplace=True)
n_pos = pos_class_sample.shape[0]
print(n_pos)
neg_class_sample.reset_index(inplace=True)
target_balance = 0.20
balance_factor = (1/target_balance)-1
n_select_neg = int(n_pos*balance_factor)
print(n_select_neg)
n_rows = np.arange(neg_class_sample.shape[0])
neg_idx = np.random.choice(n_rows, n_select_neg)
neg_class_sample = neg_class_sample.iloc[neg_idx]
sampled_stack = pd.concat([pos_class_sample, neg_class_sample], axis=0)
X_sampled = sampled_stack.sort_values('orig_index').drop('orig_index', axis=1).reset_index()
X_ = X_sampled.drop('TARGET', axis=1)
y = X_sampled['TARGET']
X_.info()
# Scale features prior to estimation
from sklearn.preprocessing import StandardScaler
X_.drop(['index', 'level_0'], axis=1, inplace=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_)
X = pd.DataFrame(X_scaled)
X.columns = X_.columns
X.info()
Admittedly, I spent much, much more time in this competition/notebook pre-processing and engineering features than modeling. One reason for this: my laptop had a fair amount of difficulty fitting models on this dataset. While the downsamping helped considerably, it took some time for my equipment to fit models, so I wasn't able to iterate as rapidly with running models with different tuning parameters than I would have liked (which further necessitated the use of the LightGBM model below). In order to get a sense of a baseline, I used k-Nearest Neighbors and to quickly get a sense of variable importances, I used Gradient Boosting as I was working through feature selection and engineering
# Split
from sklearn.model_selection import train_test_split, cross_validate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
##_ = cross_validate(knn, X_train, y_train,cv=5)['test_score']
##print("% Mean Accuracy: {}".format(np.array(_).mean()))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
for row in sorted(list(zip(X_train.columns.tolist(), gbc.feature_importances_.tolist())), key=lambda x:x[1], reverse=True):
print(row)
# Select features for training
X_totest = pd.concat([test_[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_GOODS_BUCKETS', 'AMT_ANNUITY_BUCKETS', 'AMT_CREDIT_BUCKETS', 'AMT_INCOME_TOTAL_BUCKETS',
'DAYS_EMPLOYED_BUCKETS', 'DAYS_REGISTRATION_BUCKETS', 'DAYS_ID_PUBLISH_BUCKETS',
'DAYS_BIRTH',
'NUM_INQ_TOT_BUCKETS',
'INC_CRED_RATIO', 'EMP_AGE_RATIO', 'INC_ANN_RATIO', 'CRED_ANN_RATIO',
'REGION_RATING_CLIENT', 'REGION_POPULATION_RELATIVE',
'EXT_SOURCE_1OVER2', 'EXT_SOURCE_1OVER3', 'EXT_SOURCE_2OVER1',
'EXT_SOURCE_2OVER3', 'EXT_SOURCE_3OVER2', 'EXT_SOURCE_3OVER1',
]], all_multi_encs.iloc[t_obs:,0:3].reset_index().drop('index',axis=1)], axis=1)
# Scale features prior to estimation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_totest)
X_test = pd.DataFrame(X_scaled)
X_test.columns = X_totest.columns
Getting LightGBM up and running for this competition was a breakthrough, and I credit the fantastic individual who put together this kernel for helping me get this up and running.
import lightgbm as lgb
# Create more usefully named variants of existing feature Dataframes
features = X
labels = y
test_features = X_test
# Convert to NumPy arrays - store feature names
feature_names = features.columns.tolist()
features = np.array(features)
test_features = np.array(test_features)
from sklearn.model_selection import KFold
# Create the kfold object
k_fold = KFold(n_splits = 5, shuffle = True, random_state = 101)
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
# Empty array for test predictions
test_predictions = np.zeros(test_features.shape[0])
# Empty array for out of fold validation predictions
out_of_fold = np.zeros(features.shape[0])
# Lists for recording validation and training scores
valid_scores = []
train_scores = []
# Iterate through each fold
for train_indices, valid_indices in k_fold.split(features):
# Training data for the fold
train_features, train_labels = features[train_indices], labels[train_indices]
#Validation data for the fold
valid_features, valid_labels = features[valid_indices], labels[valid_indices]
# Create the bst
bst = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
class_weight = 'balanced', learning_rate = 0.05,
reg_alpha = 0.1, reg_lambda = 0.1,
subsample = 0.8, random_state = 101)
# Train the bst
bst.fit(train_features, train_labels, eval_metric = 'auc',
eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
eval_names = ['valid', 'train'],
early_stopping_rounds = 100, verbose = 200)
# Record the best iteration
best_iteration = bst.best_iteration_
# Record the feature importances
feature_importance_values += bst.feature_importances_ / k_fold.n_splits
# Make predictions
test_predictions += bst.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
# Record the out of fold predictions
out_of_fold[valid_indices] = bst.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
# Record the best score
valid_score = bst.best_score_['valid']['auc']
train_score = bst.best_score_['train']['auc']
valid_scores.append(valid_score)
train_scores.append(train_score)
final_predictions = pd.DataFrame({'SK_ID_CURR': test['SK_ID_CURR'], 'TARGET': test_predictions})
final_predictions.to_csv('default_predictions.csv', index=False)