I believe the "Titanic" dataset is usually the first that data science aspirants tackle when trying to put skills into practice. I'm no exception, and this notebook represents one of my first end-to-end data science projects, from initial EDA, to feature engineering, to actual model training and evaluation.
# Import data manipulation and visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline
sns.set_style('darkgrid')
# Set directory and import application training data
os.chdir(r'C:\Users\Jared\Documents\datasci\projects\kaggle\titanic')
titanic = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
These datasets (along with a few others in this projects tab) come from Kaggle which I highly, highly recommend as a starting place for honing data science chops. I begin by stacking both the test and training sets on top of one another. I acknowledge there are some potential issues with this, first and foremost being data leakage - obviously, using test data along with training data for pre-processing potentially injects information contained in the test data that we wouldn't typically be able to observe. Since this represents something of a full population provided by Kaggle, instances where I deal with things like mean/median imputing for missing values benefit from using the full sample. By concatenating, I'm able to separate out the test data from the training data when I actually conduct my modelling steps further on
# Stack train and test data
t_obs = titanic.shape[0]
t_obs
all_data = pd.concat([titanic, test], axis=0)
# Some initial data inspection
titanic.head()
titanic.info()
titanic.describe()
These initial inspections shed some light on why the Titanic dataset is such a good one for a first-pass attempt at data science. The dataset is a small one (just 12 variables and less than 1,000 observations total), which means exploratory data analysis is quick, and training models takes very little time, so it's very easy to get results and iterate based on what you learn, as you go. I'll note here, considering the concatenation I mentioned above, that in order to view things relative to survival, exploratory analysis is with respect to only the training data, while cleaning and pre-processing is with respect to the full-sample stacked dataset.
# Visualization for missing data
sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()
This is a quick, informative way to visualize the incidence of missing observations in our data. Note that, despite how clean this is in our current analysis, when the number of features gets too high, this becomes decidedly less informative. For now, however, we can quickly see that the 'Cabin' feature is going to be essentially useless, and we're going to have to do a bit of work to clean up the 'Age' features (and, to a lesser extent, the 'Embarked' feature).
# Visualizing the target variable
sns.countplot('Survived', data=titanic, palette='viridis')
plt.show()
A quick countplot of the target shows that we're working with relatively balanced classes, which means we don't have to address the issues that arise with highly imbalanced classes (i.e. down-sampling, more non-traditional evaluation metrics, etc.).
# Survival by class
sns.countplot(x='Survived', hue='Pclass', data=titanic, palette='viridis')
plt.show()
When evaluating our feature variables, we're looking for those that can best separate along our target. Naturally, if we have some target variable in our dataset that can perfectly predict our target, we're essentially done. So, the more separability we can tease out here the better. Looking at the class variable, odds of survival are not good for those in 3rd class. Moreover, it appears odds of survival are somewhat higher for those in 1st class.
sns.countplot(x='Survived',hue='Sex',data=titanic, palette='viridis')
plt.show()
Here again, there's a fair amount of separability along the 'Sex' variable - females clearly have a much higher chance of surviving than males. This supports narrative accounts of the Titanic disaster that women and children were given priority for lifeboats
sns.distplot(titanic['Age'].dropna(), bins=30, color='darkblue', kde=False, hist_kws=dict(edgecolor='black', linewidth=0.5))
plt.show()
# Possible an outlier at ~500 - may be worth removing
sns.distplot(titanic['Fare'].dropna(), bins=30, color='darkblue', kde=False, hist_kws=dict(edgecolor='black', linewidth=0.5))
plt.show()
# Clearly an outlier at 8 - may be worth removing
sns.distplot(titanic['SibSp'], bins=30, color='darkblue', kde=False, hist_kws=dict(edgecolor='black', linewidth=0.5))
plt.show()
I'll attend to the outliers mentioned in the previous plots further along in the notebook - some kernels choose to remove these, but I'm inclined to believe they provide additional predictive power.
Moving forward, I begin conducting my EDA alongside feature engineering/data cleaning. Visualizing the choices made in feature engineering provides firmer justification for the choices made, and is usually done in the context of how the data look for each of these features.
Broadly speaking, my feature engineering and data cleaning addresses a few key things manifested in this dataset, some of which was gleaned from trial-and-error, iterating on this analysis several times:
# Begin by encoding the 'Pclass' feature as a set of binaries
class_dummies = pd.get_dummies(all_data['Pclass'])
class_dummies.columns = ['First', 'Second', 'Third']
sns.boxplot(x='Pclass', y='Age', data=all_data ,palette='viridis')
plt.show()
In order to capture the full sample, the missing values we noted in 'Age' from the EDA above will have to be dealt with. One could simply impute the mean or median value of the age variable, but the 'Class' variable allows us to do this more carefully. In the boxplot above, note that the mean age differs by class - older individuals are more common in 1st class, while the average age of those in the 3rd class is relatively lower. As such, I'm imputing the mean age, by class, for the missing 'Age' values, especially since the Pclass variable is fully populated
# Replace NaN age values with the mean values within each class
age_by_Pclass = all_data.groupby('Pclass').median()['Age']
all_data['Age'] = np.where((all_data['Age'].isnull()) & (all_data['Pclass'] == 1), age_by_Pclass.iloc[0],
np.where((all_data['Age'].isnull()) & (all_data['Pclass'] == 2), age_by_Pclass.iloc[1],
np.where((all_data['Age'].isnull()) & (all_data['Pclass'] == 3), age_by_Pclass.iloc[2],
all_data['Age'])))
from sklearn.preprocessing import LabelEncoder
# Create age buckets
all_data['age_buckets'] = pd.qcut(all_data['Age'], 6)
label = LabelEncoder()
all_data['age_buckets'] = label.fit_transform(all_data['age_buckets'])
sns.countplot(all_data['age_buckets'], palette='viridis')
plt.show()
Working through this analysis, I found that most of the continuous variables (notably: age, fare, and parents/siblings) perform better when fitting the data when they are binned. After imputing the missing values, I cut the age variable into six buckets.
sns.countplot(x='Survived', hue='age_buckets', data=all_data, palette='viridis')
plt.show()
Here, again, we expect additional predictive power from the binned 'Age' feature - in particular, those in the second age bucket have very low odds of survival
# Create dummy variables for 'Sex'
binary_sex = pd.get_dummies(all_data['Sex'])
# Create dummy variables for 'Embarked' - impute 'S' since this is the most common occurence
all_data['Embarked'][all_data['Embarked'].isnull()] = 'S'
binary_embark = pd.get_dummies(all_data['Embarked'])
Recall that there were two missing values in the 'Embarked' variable. Since the incidence of missing values is so little, more exhaustive imputation is likely unnecessary. Here, I populate the missing values with the most common 'Embarked' class.
len(all_data.Name.unique())
all_data.Name.unique()[1:5]
The 'Name' variable requires a bit more engineering to be a useful predictive features. At present, it represents something of an individual fixed-effect (there's my econometrics training creeping in) that uniquely identifies each individual with their own name and title. However, we observe that there is a common structure in the way these names are entered in the dataset. As such, I extract the title portion of these names for use in my model.
# Create a 'titles' variable based on names that occur in both train and test
all_titles = all_data.Name.str.extract('(?P<Title>(?<=, ).*?(?=\.))', expand=True)
# Visualize the occur of these titles and the quantity
plt.figure(figsize=(18,4))
sns.countplot(x=all_titles['Title'].tolist(), palette='viridis')
plt.show()
By extracting these titles, we whittle down the number of unique names (1,307) to 17 unique titles, which can be furthered honed by grouping synonymous titles into fewer buckets.
# Map titles to create groups, lumping the obscure titles into a single bucket
title_map = {'Mr':'Mr', 'Mrs':'Mrs', 'Mme':'Mrs', 'Miss':'Ms', 'Ms':'Ms', 'Mlle':'Ms',
'Master':'Spec', 'Rev':'Spec', 'Dr':'Spec',
'Don':'Rare', 'Major':'Rare', 'Lady':'Rare', 'Sir':'Rare', 'Col':'Rare', 'Capt':'Rare', 'the Countess':'Rare',
'Jonkheer':'Rare', 'Dona':'Rare'}
all_data['title_buckets'] = all_titles['Title'].map(title_map)
sns.countplot(x=all_data['title_buckets'].tolist(), palette='viridis')
plt.show()
I've opted for five unique title buckets in my analysis. 'Mr', 'Mrs', and 'Ms,' are fairly standard choices. The remaining titles, however, I break out into two different buckets - 'Rare' and 'Spec,' or special. In the special category, I've included those individuals who are doctors, reverends, and masters. The remaining titles, which have far fewer observations, are lumped into the 'Rare' bucket. The distribution of these categories, relative to the target, are displayed below. We observe a similar phenomenon related to the 'Sex' feature here.
title_dummies = pd.get_dummies(all_data['title_buckets'])
# Visualize these relative to survival
sns.countplot(x='Survived', hue='title_buckets', data=all_data, palette='viridis')
plt.show()
I chose to treat the 'fare' feature in much the same way as the 'Age' feature - I attended to missing variables by using median fares by class (it makes sense that the average fare would be much higher for the higher classes) and bucket this accordingly.
# Visualize class versus fare
plt.figure(figsize=(6,6))
sns.boxplot(x='Pclass', y='Fare', data=all_data ,palette='viridis')
plt.show()
fare_by_Pclass = all_data.groupby('Pclass').median()['Fare']
all_data['Fare'] = np.where((all_data['Fare'].isnull()) & (all_data['Pclass'] == 1), fare_by_Pclass.iloc[0],
np.where((all_data['Fare'].isnull()) & (all_data['Pclass'] == 2), fare_by_Pclass.iloc[1],
np.where((all_data['Fare'].isnull()) & (all_data['Pclass'] == 3), fare_by_Pclass.iloc[2],
all_data['Fare'])))
# Create fare buckets
all_data['fare_buckets'] = pd.qcut(all_data['Fare'], 5)
label = LabelEncoder()
all_data['fare_buckets'] = label.fit_transform(all_data['fare_buckets'])
sns.countplot(all_data['fare_buckets'], palette='viridis')
plt.show()
# Having paid a high fare clearly has value
sns.countplot(x='Survived', hue='fare_buckets', data=all_data, palette='viridis')
plt.show()
Using these five quantile buckets, we can plainly see that those in the highest fare bucket faired (no pun intended) better than their lower-fare counterparts, which corresponds nicely with what we found in the class variable.
You'll notice a trend here- that is converting the continous variables into discretized variants. I chose to combine the 'SibSp' (the number of siblings/spouses on board) and 'Parch' (the number of parents and children on board) into a single 'Family Size' variable reflecting the entire size of a family on board, and adding one for the individual to whom these features are with respect to
# Parents and children plus siblings/spouses - large families do not fare very well
all_data['Size'] = all_data['SibSp'] + all_data['Parch'] + 1
all_data['Alone'] = np.where(all_data['Size'] == 1, 1, 0)
sns.countplot(x='Survived', hue='Size', data=all_data, palette='viridis')
plt.show()
# Create Size buckets
all_data['fam_buckets'] = np.where(all_data['Size'] == 1, 1,
##np.where(all_data['Size'] == 2, 2,
np.where((all_data['Size'] >= 2) & (all_data['Size'] <= 4),2,
np.where((all_data['Size'] >= 5) & (all_data['Size'] < 8),3,4)))
sns.countplot(all_data['fam_buckets'], palette='viridis')
plt.show()
# Large families are separable - big families don't make it, or very rarely make it, and your chances are better in the 2-4 range
sns.countplot(x='Survived', hue='fam_buckets', data=all_data, palette='viridis')
plt.show()
Per my mention of not removing some of the outliers above, there's a fairly clear distinction in the survival rate by family size. Simply put- larger families survive at a much lower rate. Moreover, very large families simply do not survive in our training dataset.
Pursuing this vein of feature engineering was insipired by this kernel. While I did not go to the same lengths to develop this feature as in the linked notebook above, I did choose to examine to what extent these groups could be identified in a relatively simpler way, and developed a groups feature that could be bucketed much like the other continuous features established herein.
# Extract last names
all_data['fam_name'] = all_data.Name.str.extract('(?P<Title>.*?(?=,))', expand=False)
print('Number of unique surnames in sample: {}'.format(len(all_data['fam_name'].unique())))
print('Number of unique tickets in sample: {}'.format(len(all_data['Ticket'].unique())))
# It's fair to assume most of the variation in last names is captured above in the fam_buckets variable, but
# tickets might capture those that booked together who are not necessarily related
all_data['Ticket'].value_counts()[0:10]
# Case and point: This group below is considered 'Alone' but clearly booked together as a group
all_data[all_data['Ticket'] == '1601']
tickets = pd.DataFrame(all_data['Ticket'].value_counts())
tickets.reset_index(inplace=True)
tickets.columns = ['Ticket', 'Group_size']
ticket_map = dict(zip(tickets['Ticket'].tolist(), tickets['Group_size'].tolist()))
all_data['Group_size'] = all_data['Ticket'].map(ticket_map)
sns.countplot(x='Survived', hue='Group_size', data=all_data, palette='viridis')
plt.show()
# Create Group_size buckets
all_data['group_buckets'] = np.where(all_data['Group_size'] == 1, 1,
##np.where(all_data['Group_size'] == 2, 2,
np.where((all_data['Group_size'] >= 2) & (all_data['Group_size'] <= 4),2,
np.where((all_data['Group_size'] >= 5) & (all_data['Group_size'] <= 8),3,4)))
sns.countplot(x='Survived', hue='group_buckets', data=all_data, palette='viridis')
plt.show()
I still contend that the 'Cabin' feature is largely useless in this analysis, but a number of kernels have attempted to glean predictive power from it. The simplest way to do so is divide the feature into those assigned a cabin, and those not.
# Recode missings and examine distribution
all_data['Cabin'][all_data['Cabin'].isnull()] = 'X'
all_data['Section'] = all_data.Cabin.str.extract('(?P<Section>([A-z]))', expand=True)['Section']
plt.figure(figsize=(18,4))
sns.countplot(x='Survived', hue='Section', data=all_data, palette='viridis')
plt.show()
plt.figure(figsize=(18,4))
sns.countplot(x='Pclass', hue='Section', data=all_data, palette='viridis')
plt.show()
section_binaries = pd.get_dummies(all_data['Section'])
all_data['assigned'] = np.where(all_data['Cabin'] == 'X', 1, 0)
sns.countplot(x='Survived', hue='assigned', data=all_data, palette='viridis')
plt.show()
Doggedly pursuing higher public scores on the Kaggle leaderboard, led to many iterations of the above features, and led to the development of much of the cleaning and engineering detailed above. In the end, I was able to achieve the highest Cross-Validation scores, and highest public scores on the Kaggle website, with a subset of the features described and engineered above. To some extent, this was also determined through my iterations looking a feature importance plots.
I used the bucketed 'Fare', 'Family', and 'Age', binary 'Sex' variables, constructed 'Title' feature, and dummy variables built on the 'Class' feature. The 'Embarked' and 'Section' binaries did not improve predictive accuracy. Additionally, despite all the work put into the creation of the 'group_buckets' variable based on groups of individuals purchasing tickets together, I did not find it reliably improved predictive accuracy, with the exception of the LightGBM model, which is why I've left it in here.
# Select features for training
feats = pd.concat([all_data[['fare_buckets', 'fam_buckets', 'age_buckets', 'group_buckets']],
binary_sex,
title_dummies,
## section_binaries,
## binary_embark,
class_dummies
], axis=1)
feats.info()
# Since some feature engineering and variable imputation had to be done with respect to the full population, I split the data
X_ = feats[:t_obs]
test_ = feats[t_obs:]
y = titanic['Survived']
# Scale features prior to estimation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_)
X = pd.DataFrame(X_scaled)
X.columns = X_.columns
Since I intend to use a wide range of models to assess their predictive power with this selection of features for training, it's good practice to scale all the variables in thte dataset. I rely on the sklearn StandardScalar to perform this scaling.
X.info()
plt.figure(figsize=(10,6))
sns.heatmap(X.corr(), cmap='viridis', annot=True)
plt.show()
It's well established that there is no one perfect algorithm for every machine learning problem. With the sklearn library, it's trivial to test a the predictive accuracy of a number of models. In this analysis, I assess the performance of the following 8 classifiers:
kNN and Logistic Regression are fairly standard baseline models to assess how well our model is doing absent more complex methods, while the latter four are a mix of ensemble models which are regularly referenced in other kernels assessing this dataset
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_validate
model_performance = {}
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
_ = cross_validate(knn, X_train, y_train,cv=5)['test_score']
model_performance['k-Nearest Neighbors'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
_ = cross_validate(lr, X_train, y_train,cv=5)['test_score']
model_performance['Logistic Regression'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
_ = cross_validate(dtc, X_train, y_train,cv=5)['test_score']
model_performance['Decision Tree'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Support Vector Machine
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
_ = cross_validate(svc, X_train, y_train,cv=5)['test_score']
model_performance['Support Vector Machine'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
_ = cross_validate(rfc, X_train, y_train,cv=5)['test_score']
model_performance['Random Forest'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Adaboost
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
_ = cross_validate(ada, X_train, y_train,cv=5)['test_score']
model_performance['Adaboost'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# ExtraTrees
from sklearn.ensemble import ExtraTreesClassifier
xtc = ExtraTreesClassifier()
xtc.fit(X_train, y_train)
y_pred = xtc.predict(X_test)
_ = cross_validate(xtc, X_train, y_train,cv=5)['test_score']
model_performance['ExtraTrees'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Gradient-Boosted Decision Tree
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
_ = cross_validate(gbc, X_train, y_train,cv=5)['test_score']
model_performance['Gradient-Boosting Classifier'] = np.array(_).mean()
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
_ = sorted([x for x in model_performance.items()], key=lambda x: x[1])
fig, ax = plt.subplots(figsize=(10,6))
ax.barh(np.arange(len(_)), [x[1] for x in _])
ax.set_yticks(np.arange(len(_)))
ax.set_yticklabels([x[0] for x in _])
ax.set_xlabel('Mean Accuracy')
ax.set_title('Mean CV Accuracy, 5-fold CV')
plt.show()
_ = sorted(list(zip(X_train.columns.tolist(), gbc.feature_importances_.tolist())), key=lambda x:x[1])
fig, ax = plt.subplots(figsize=(10,6))
ax.barh(np.arange(len(_)), [x[1] for x in _])
ax.set_yticks(np.arange(len(_)))
ax.set_yticklabels([x[0] for x in _])
ax.set_xlabel('Contribution')
ax.set_title('Variable Importance from GBC Estimation')
plt.show()
from sklearn.model_selection import GridSearchCV
# Tune the gbm
grid_values = {'n_estimators': [10, 100, 1000],
'learning_rate': [0.001, 0.01, 0.1],
'max_depth': [1,2,3],
'max_features': ['sqrt', 'log2']}
gbc_grid = GridSearchCV(gbc, param_grid = grid_values)
gbc_grid.fit(X, y)
print(gbc_grid.best_params_)
svc = SVC()
# Tune the SVM
grid_values = {'C':[1,10,20,30,40,50,60,70,80,90,100],
'gamma':[0.01],
'kernel': ['rbf', 'poly']}
svc_grid = GridSearchCV(svc, param_grid = grid_values)
svc_grid.fit(X,y)
print(svc_grid.best_params_)
rfc = RandomForestClassifier()
# Tune the RandomForest
grid_values = {'max_depth':[3,4,5,6,7,8,9,10,11,12],
'max_features':['log2', 'sqrt'],
'n_estimators':[3,4,5,6,7,8,9,10,11,12],
'min_samples_split':[2,3,4]
}
rfc_grid = GridSearchCV(rfc, param_grid = grid_values)
rfc_grid.fit(X_train, y_train)
print(rfc_grid.best_params_)
While it seems the simple Logistic Regression performs the best in terms of CV Mean Accuracy (per the plot above), I'm inclined to use the Gradient Boosting Machine (which performs second per the same metric) because its hyper-parameters can be more finely tuned. As such, using the tuned hyperparameters established in the grid search above, I conduct a final run of the Gradient Boosting Machine to generate final predictions for submission
# Scale features prior to estimation
test_scaled = scaler.fit_transform(test_)
test_feats = pd.DataFrame(test_scaled)
# Create more usefully named variants of existing feature Dataframes
features = X
labels = y
test_features = test_feats
# Fit final model
clf = GradientBoostingClassifier(learning_rate=0.01, max_depth=3, max_features='log2', n_estimators=1000)
clf.fit(X,y)
y_pred = clf.predict(test_feats)
# Prepare data for evaluation
final_predictions = pd.DataFrame.from_dict(dict(zip(test['PassengerId'], y_pred)), orient='index')
final_predictions.reset_index(inplace=True)
final_predictions.columns = ['PassengerId', 'Survived']
final_predictions.to_csv('survivor_predictions.csv',index=False)
When I began working on this notebook, I was fully aware of the LightGBM model and the degree to which it (along with XGBoost) was being employed in winning kernels, but found working with it to be a bit more difficult. After working for some time on the Home Credit Default Risk competition, I stumbled across and immensely helpful kernel for getting this up and running. While I spent less time with hyperparameter tuning in this one, I found that once I got things up and running it immediately improved the scores I was generating both on my cross-validation sets and for my submissions to the competition. This landed my kernel within the top 15%
# Convert to NumPy arrays - store feature names
feature_names = features.columns.tolist()
features = np.array(features)
test_features = np.array(test_features)
from sklearn.model_selection import KFold
# Create the kfold object
k_fold = KFold(n_splits = 5, shuffle = True, random_state = 101)
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
# Empty array for test predictions
test_predictions = np.zeros(test_features.shape[0])
# Empty array for out of fold validation predictions
out_of_fold = np.zeros(features.shape[0])
# Lists for recording validation and training scores
valid_scores = []
train_scores = []
import lightgbm as lgb
# Iterate through each fold
for train_indices, valid_indices in k_fold.split(features):
# Training data for the fold
train_features, train_labels = features[train_indices], labels[train_indices]
#Validation data for the fold
valid_features, valid_labels = features[valid_indices], labels[valid_indices]
# Create the bst
bst = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
class_weight = 'balanced', learning_rate = 0.05,
reg_alpha = 0.1, reg_lambda = 0.1,
subsample = 0.8, random_state = 101)
# Train the bst
bst.fit(train_features, train_labels, eval_metric = 'auc',
eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
eval_names = ['valid', 'train'],
early_stopping_rounds = 100, verbose = 200)
# Record the best iteration
best_iteration = bst.best_iteration_
# Record the feature importances
feature_importance_values += bst.feature_importances_ / k_fold.n_splits
# Make predictions
test_predictions += bst.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
# Record the out of fold predictions
out_of_fold[valid_indices] = bst.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
# Record the best score
valid_score = bst.best_score_['valid']['auc']
train_score = bst.best_score_['train']['auc']
valid_scores.append(valid_score)
train_scores.append(train_score)
# Prepare data for evaluation
final_predictions = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': test_predictions})
final_predictions['Survived'] = np.where(final_predictions['Survived'] > 0.7, 1, 0)
final_predictions.to_csv('survivor_predictions.csv',index=False)
I find it hard to believe any auto-didact can really get started in this field in a vacuum. I looked through a LOT of kernels, including the one linked above, and owe a lot of credit to Jose Portilla's bootcamp on Udemy for developing my thinking about the structure and first steps to get this off the ground. While this isn't necessarily an endorsement, I found the course to be very helpful in getting a gist of the general workflow for these data science projects.