Predicting IBM Churn with LDA and XGBoost

Last updated on Sep 18, 2021 11 min read

HR Churn Analysis

This is me fiddling around with an employee attrition dataset on my last few days at PNC.

Description

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import sklearn

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
%load_ext autoreload

Read in Data

df = pd.read_csv('ibm_attrition_file.csv').sort_values('Attrition')

df.head()
df.shape
df.dtypes

sns.displot(df, x='Attrition', shrink = .8)

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
734	22	No	Travel_Rarely	217	Research & Development	8	1	Life Sciences	1	1019	...	1	80	1	4	3	2	4	3	1	1
949	39	No	Travel_Rarely	524	Research & Development	18	2	Life Sciences	1	1322	...	1	80	0	9	6	3	8	7	1	7
948	30	No	Travel_Rarely	634	Research & Development	17	4	Medical	1	1321	...	4	80	2	9	2	3	9	1	0	8
945	50	No	Travel_Rarely	1322	Research & Development	28	3	Life Sciences	1	1317	...	2	80	0	25	2	3	3	2	1	2
944	28	No	Non-Travel	1476	Research & Development	1	3	Life Sciences	1	1315	...	1	80	3	10	6	3	9	8	7	5

5 rows × 35 columns

(1470, 35)






Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object






<seaborn.axisgrid.FacetGrid at 0x7ffca1cf6510>

EDA

I first look at some features that might be important in the dataset.

Attrition by Age

Attrition rates are higher among younger employees.

sns.displot(
    df.Age,
    kde = True
)

<seaborn.axisgrid.FacetGrid at 0x7ffc9e629ed0>

sns.displot(data=df, x="Age", hue="Attrition", multiple="stack")

<seaborn.axisgrid.FacetGrid at 0x7ffca2c9f410>

Job Satisfaction

Lower job satisfaction unsurprisingly yields a higher attrition rate although perhaps not as high as one might expect.

g = sns.displot(
    data = df,
    x = 'JobSatisfaction',
    hue = 'Attrition',
    multiple = 'stack',
    discrete=True,
    shrink = .8
    )

g.set_axis_labels("Job Satisfaction", "Count", labelpad=10)
g.set(title="Lower Job Satisfaction leads to Higher Attrition")
g.fig.set_size_inches(6.5, 4.5)
g.despine(trim=True)

<seaborn.axisgrid.FacetGrid at 0x7ffca3e423d0>






<seaborn.axisgrid.FacetGrid at 0x7ffca3e423d0>






<seaborn.axisgrid.FacetGrid at 0x7ffca3e423d0>

Gender and Marital Status

Single men and women are more likely to leave the company but age may be a confounder. Men are slightly more likely to leave than women.

df_new = df.copy()
df_new['Attr'] = df_new['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)
df_new['Count'] = 1
df_new_group = df_new.groupby(['Gender', 'MaritalStatus'])[['Attr', 'Count']].sum()
df_new_group['pct'] = df_new_group['Attr'] / df_new_group['Count']

df_heat = df_new_group.reset_index()[['Gender','MaritalStatus','pct']]\
    .pivot('MaritalStatus', 'Gender')

sns.heatmap(
    data=df_heat
)

<matplotlib.axes._subplots.AxesSubplot at 0x7ffca5c05c10>

Correlation Plots

df_num = df[df.columns[df.dtypes == 'int64']][['Age', 'DailyRate', 'DistanceFromHome',\
                                              'MonthlyIncome', 'PerformanceRating', 'JobSatisfaction', \
                                             'WorkLifeBalance']]


g = sns.PairGrid(df_num)
g.map_upper(sns.histplot)
g.map_lower(sns.kdeplot, fill=True)
g.map_diag(sns.histplot, kde=True)

<seaborn.axisgrid.PairGrid at 0x7ffca5c05cd0>






<seaborn.axisgrid.PairGrid at 0x7ffca5c05cd0>






<seaborn.axisgrid.PairGrid at 0x7ffca5c05cd0>

Many features are correlated although no features have a higher correlation than 0.3 or less than -0.1.

df_heat = df[df.columns[df.dtypes == 'int64']]\
    .drop(['EmployeeCount', 'EmployeeNumber', 'StandardHours'], axis=1)

corr = df_heat.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7ffca7201b10>

Predictive Analytics - LDA

Linear Discriminant Analysis (LDA) is a feature reduction method for data with discrete classes. It is like PCA except that it takes advantage of information about the classification in the training data. It projects data into fewer dimensions by maximizing both the mean distance between the median data point of each class and minimizing the “spread” within each class.

# Drop columns with no information
df.columns
df_clean = df.drop(['EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18'], axis=1)
df_clean['Attrition'] = df_clean['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

Encode Features

df_objs = df_clean.columns[df_clean.dtypes == 'object']

df_clean[df_objs]
df_enc = pd.get_dummies(df_clean, prefix = df_objs)
df_enc['Attrition'] = df_clean['Attrition'].apply(lambda x: 'Yes' if x == 1 else 'No')
df_enc.head()

	BusinessTravel	Department	EducationField	Gender	JobRole	MaritalStatus	OverTime
734	Travel_Rarely	Research & Development	Life Sciences	Male	Laboratory Technician	Married	No
949	Travel_Rarely	Research & Development	Life Sciences	Male	Manufacturing Director	Single	No
948	Travel_Rarely	Research & Development	Medical	Female	Manager	Married	Yes
945	Travel_Rarely	Research & Development	Life Sciences	Female	Research Director	Married	Yes
944	Non-Travel	Research & Development	Life Sciences	Female	Laboratory Technician	Married	No
...	...	...	...	...	...	...	...
370	Travel_Rarely	Sales	Life Sciences	Female	Sales Representative	Single	No
1036	Travel_Frequently	Research & Development	Life Sciences	Male	Laboratory Technician	Married	Yes
1033	Travel_Frequently	Research & Development	Life Sciences	Female	Manufacturing Director	Single	No
1057	Travel_Frequently	Sales	Technical Degree	Female	Sales Executive	Single	No
0	Travel_Rarely	Sales	Life Sciences	Female	Sales Executive	Single	Yes

1470 rows × 7 columns

	Age	Attrition	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	...	JobRole_Manufacturing Director	JobRole_Research Director	MaritalStatus_Married	MaritalStatus_Single	OverTime_No	OverTime_Yes
734	22	No	217	8	1	2	94	1	1	1	...	0	0	1	0	1	0
949	39	No	524	18	2	1	32	3	2	3	...	1	0	0	1	1	0
948	30	No	634	17	4	2	95	3	3	1	...	0	0	1	0	0	1
945	50	No	1322	28	3	4	43	3	4	1	...	0	1	1	0	0	1
944	28	No	1476	1	3	3	55	1	2	4	...	0	0	1	0	1	0

5 rows × 52 columns

Split and Normalize Data

import sklearn.preprocessing as sp
import sklearn.model_selection as skms
train, test = skms.train_test_split(df_enc, random_state=42)
trainY, testY = train.Attrition, test.Attrition
trainX, testX = train.drop(['Attrition'], axis=1), test.drop(['Attrition'], axis=1)

df_objs = df_clean.columns[df_clean.dtypes == 'object']

df_clean[df_objs]
df_enc = pd.get_dummies(df_clean, prefix = df_objs)
df_enc.head()

scaler = sp.StandardScaler().fit(trainX)
scale_train = scaler.transform(trainX)
scale_test = scaler.transform(testX)

	BusinessTravel	Department	EducationField	Gender	JobRole	MaritalStatus	OverTime
734	Travel_Rarely	Research & Development	Life Sciences	Male	Laboratory Technician	Married	No
949	Travel_Rarely	Research & Development	Life Sciences	Male	Manufacturing Director	Single	No
948	Travel_Rarely	Research & Development	Medical	Female	Manager	Married	Yes
945	Travel_Rarely	Research & Development	Life Sciences	Female	Research Director	Married	Yes
944	Non-Travel	Research & Development	Life Sciences	Female	Laboratory Technician	Married	No
...	...	...	...	...	...	...	...
370	Travel_Rarely	Sales	Life Sciences	Female	Sales Representative	Single	No
1036	Travel_Frequently	Research & Development	Life Sciences	Male	Laboratory Technician	Married	Yes
1033	Travel_Frequently	Research & Development	Life Sciences	Female	Manufacturing Director	Single	No
1057	Travel_Frequently	Sales	Technical Degree	Female	Sales Executive	Single	No
0	Travel_Rarely	Sales	Life Sciences	Female	Sales Executive	Single	Yes

1470 rows × 7 columns

	Age	DailyRate	DistanceFromHome	Education	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	...	JobRole_Manufacturing Director	JobRole_Research Director	MaritalStatus_Married	MaritalStatus_Single	OverTime_No	OverTime_Yes
734	22	217	8	1	2	94	1	1	1	...	0	0	1	0	1	0
949	39	524	18	2	1	32	3	2	3	...	1	0	0	1	1	0
948	30	634	17	4	2	95	3	3	1	...	0	0	1	0	0	1
945	50	1322	28	3	4	43	3	4	1	...	0	1	1	0	0	1
944	28	1476	1	3	3	55	1	2	4	...	0	0	1	0	1	0

5 rows × 52 columns

Cross Validation Training

import sklearn.discriminant_analysis as sda
import sklearn.metrics as sm

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

model = sda.LinearDiscriminantAnalysis()

# model.fit(scale_train, trainY)
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
# evaluate model
scores = cross_val_score(model, scale_train, trainY, scoring='accuracy', cv=cv, n_jobs=2)
# summarize result
scores

print('Mean Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

model.fit(scale_train, trainY)

test_preds = model.predict(scale_test)

array([0.84684685, 0.87387387, 0.86363636, 0.88181818, 0.85454545,
       0.89090909, 0.89090909, 0.85454545, 0.90909091, 0.88181818,
       0.86486486, 0.87387387, 0.88181818, 0.88181818, 0.85454545,
       0.81818182, 0.89090909, 0.88181818, 0.86363636, 0.9       ,
       0.91891892, 0.87387387, 0.89090909, 0.85454545, 0.82727273,
       0.88181818, 0.84545455, 0.88181818, 0.89090909, 0.87272727])



Mean Accuracy: 0.873 (0.022)





LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

Quick evaluation on test set

model = sda.LinearDiscriminantAnalysis()
model.fit(scale_train, trainY)

test_preds = model.predict(scale_test)
sm.confusion_matrix(test_preds, testY)
print(f'Test Accuracy: {round(sm.accuracy_score(test_preds, testY),3)*100}%')

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)






array([[310,  30],
       [  8,  20]])



Test Accuracy: 89.7%

Extract Feature Importances

lda_features = pd.DataFrame(data={'Feature': trainX.columns, 'Contribution':model.coef_[0], 'abs_cont': np.abs(model.coef_[0])})\
    .sort_values('abs_cont', ascending = False)[['Feature', 'Contribution']]

The test accuracy is actually even higher than the cross validated training accuracy

Visualize in one dimension

The accuracy is pretty high even reducing everything down to just one dimension which is pretty impressive. I show with some jitter below.

fitted_model = model.fit(scale_train, trainY)

one_d_train = fitted_model.transform(scale_train)
one_d_train = pd.DataFrame(one_d_train, columns = ['X'])
one_d_train['Label'] = trainY.values

jitter = np.random.random(one_d_train.shape[0])

fig, axs = plt.subplots(ncols=2)

fig.set_size_inches(15, 6)
# df['korisnika'].plot(ax=axs[0])
# df['osiguranika'].plot(ax=axs[1])

sns.scatterplot(data=one_d_train, x = 'X', y=jitter,  hue="Label", hue_order = ['No', 'Yes'], ax = axs[0])

one_d_test = fitted_model.transform(scale_test)
one_d_test = pd.DataFrame(one_d_test, columns = ['X'])
one_d_test['Label'] = testY.values

jitter = np.random.random(one_d_test.shape[0])

sns.scatterplot(data=one_d_test, x = 'X', y=jitter,  hue="Label", hue_order = ['No', 'Yes'], ax = axs[1])

<matplotlib.axes._subplots.AxesSubplot at 0x7ffca2b77490>






<matplotlib.axes._subplots.AxesSubplot at 0x7ffca58f45d0>

But how does this compare to XGBoost?

from xgboost import XGBClassifier
from xgboost import plot_importance

model = XGBClassifier()
model.fit(trainX, trainY)

train_xgb_preds = model.predict(trainX)
test_xgb_preds = model.predict(testX)

sm.confusion_matrix(train_xgb_preds, trainY)
print(f'Test Accuracy: {round(sm.accuracy_score(train_xgb_preds, trainY),3)*100}%')

sm.confusion_matrix(test_xgb_preds, testY)
print(f'Test Accuracy: {round(sm.accuracy_score(test_xgb_preds, testY),3)*100}%')

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)






array([[915,   0],
       [  0, 187]])



Test Accuracy: 100.0%





array([[311,  35],
       [  7,  15]])



Test Accuracy: 88.6%

Untrained, XGBoost classifies the training data perfectly but is actually outperformed by LDA on the test data. I suspect a simple grid search would amerliorate this. First, however I want to compare the feature imporance in the two jobs.

ax = plot_importance(model)
fig = ax.figure
fig.set_size_inches(10, 10)

lda_features

	Feature	Contribution
50	OverTime_Yes	0.507787
49	OverTime_No	-0.507787
4	EnvironmentSatisfaction	-0.491681
45	JobRole_Sales Representative	0.489383
11	NumCompaniesWorked	0.487821
8	JobSatisfaction	-0.445302
6	JobInvolvement	-0.417925
21	YearsSinceLastPromotion	0.389570
16	TotalWorkingYears	-0.380487
22	YearsWithCurrManager	-0.299832
48	MaritalStatus_Single	0.295092
18	WorkLifeBalance	-0.284870
2	DistanceFromHome	0.262430
0	Age	-0.254437
14	RelationshipSatisfaction	-0.252732
20	YearsInCurrentRole	-0.251739
41	JobRole_Manufacturing Director	-0.241343
24	BusinessTravel_Travel_Frequently	0.238570
23	BusinessTravel_Non-Travel	-0.219165
37	JobRole_Healthcare Representative	-0.215543
15	StockOptionLevel	-0.199956
43	JobRole_Research Scientist	-0.195798
19	YearsAtCompany	0.193145
17	TrainingTimesLastYear	-0.191518
39	JobRole_Laboratory Technician	0.184980
42	JobRole_Research Director	-0.179125
38	JobRole_Human Resources	0.176008
47	MaritalStatus_Married	-0.157675
29	EducationField_Human Resources	0.153075
27	Department_Research & Development	0.144829
34	EducationField_Technical Degree	0.144608
26	Department_Human Resources	-0.139333
46	MaritalStatus_Divorced	-0.137834
9	MonthlyIncome	0.136029
7	JobLevel	-0.122578
36	Gender_Male	0.105500
35	Gender_Female	-0.105500
44	JobRole_Sales Executive	0.094384
1	DailyRate	-0.091678
33	EducationField_Other	-0.090564
28	Department_Sales	-0.089418
12	PercentSalaryHike	-0.086685
32	EducationField_Medical	-0.086016
13	PerformanceRating	0.065262
25	BusinessTravel_Travel_Rarely	-0.056734
5	HourlyRate	-0.051925
31	EducationField_Marketing	0.049313
30	EducationField_Life Sciences	-0.031653
40	JobRole_Manager	-0.028730
3	Education	-0.025777
10	MonthlyRate	0.009458

Interestingly LDA shows the monthly and hourly rates having some of the lowest feature importance while XGBoost has them among the highest.