Clustering and Customer Churn Analysis

7 min readNov 27, 2020

Small Business | Tax Office Insights

Abstract

Through data exploration and analysis of clientele data over a three-year period, recommendations for business development were provided through customer segmentation and analysis of customer churn. The most significant customer loss occurs in clients between the ages of 25 and 35 coming from single income households. Our analysis outlines strategies for targeted marketing and customer retention.

Methodology and Business Problem

A small tax office is looking to understand their client base in order to prepare for the upcoming tax season. Small businesses depend on face-to-face interactions in order to build long lasting relationships with their customer base. COVID-19 has taken a toll on businesses nationwide and the tax office is looking to prepare for the upcoming year by understanding customer retention and leveraging targeted marketing.

Our team performed analysis on the office customer base for targeted marketing and customer retention. Three years of tax data were provided for customer analysis 2016, 2017 and 2018 (tax years 2017, 2018 and 2019 respectively).

Breakdown

The project was broken out into three sections for simplicity. The first included data manipulation and cleaning. The second was an analysis of customer segmentation via Principal Component Analysis and K-Means clustering. The third was an observation of customer churn (customer turnover) via multiple machine learning algorithms and neural networks.

Due to the sensitivity of the data and the need for significant anonymizing of data, the data cleaning portion will not be covered in this post.

An executive dashboard was built to give a birds eye view of the data after cleaning, using Google’s data studio.

To begin, we’ll introduce the data science framework we followed throughout our analysis. The OSEMN framework (obtain, scrub, explore, model and interpret).

The packages that were used in our analysis are:

Python Packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Sklearn Packages

from sklearn.preprocessing import StandardScaler, normalize
from sklearn.metrics import silhouette_score
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

Pycaret Packages

import pycaret.preprocess as preprocess
from pycaret.classification import *
import pycaret.preprocess as preprocess

Tensorflow Packages

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping

Exploring the data

Through the analysis of financial features occurring in one, two or all three of the listed years, the feature ‘Customer Churn’ was generated. In order to make our analysis ‘fair’ in nature, new customers were not included in the customer retention analysis.

Prior to the removal of these new customers, a comparison was made to observe trends in returning customers versus new ones based on customer age, tax preparer name and single versus dual income household using the python package seaborn.

An analysis of the distribution of the data and any effects of multicollinearity was conducted using plotly’s iplot function as well as seaborn’s heatmap visualization.

clients.iplot(kind='box')corr = clients.corr() # analyzing correlation
fig, ax = plt.subplots(figsize=(11,14))
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr, mask=mask, square=True, annot=True, cmap='YlGnBu')
ax.patch.set_edgecolor('black')  
ax.patch.set_linewidth('1')
ax.set_title('Correlation & Heat Map', fontsize=15, fontfamily='serif')
plt.show()

The data was relatively well distributed with only Federal Tax (owed) and Adjusted Gross Income having significant outliers. This makes sense as a larger income earned results in higher federal tax liability. With no multicollinearity values over 0.75, our clustering analysis continued with all of the features at our disposal.

Customer Segmentation

Prior to modeling, we ensured our data was numerical (no need for one hot encoding) and the data was scaled using a standard scaler.

scaler = StandardScaler()
df_scaled = scaler.fit_transform(clients)

Next, principal component analysis was conducted in order to reduce the dimensionality of the dataset from 14 features to 2 features while retaining as much of the relevant data as possible.

# Reducing to two columns
pca = PCA(n_components=2)
principal_comp = pca.fit_transform(df_scaled) # passing scaled data# creating dataframe out of 2 component result
pca_df = pd.DataFrame(data = principal_comp, columns=['PCA1','PCA2'])

Through the use of the elbow method and a silhouette coefficient, an appropriate number of clusters was selected. We selected three.

K-Means Clustering

We fit the model, visualized the segmentation using seaborn and the following commonalities in the segments were discovered.

# 3 Clusters
km = cluster.KMeans(n_clusters=3, max_iter=300, random_state=101)
pca_df['Three_Clusters'] = km.fit_predict(pca_df)# Three cluster scatterplot/seaborn
sns.lmplot(x='PCA1', y='PCA2', data=pca_df, fit_reg=False, hue='Three_Clusters',
           palette='Paired',size=6,aspect=1);

The three-segment grouping offered insight into customer trends. The most prevalent cluster, cluster 1 — represented the highest customer churn with client features that include:

Smaller family sizes, mid-range age

Male and female, single income households

Preparation completed by only preparers

Mid-range tax and (AGI) income

Pays slightly lower than the fees listed

Processed mid-week, typically in late February

With this information. We began our customer churn analysis.

Customer Churn | Modeling

The data was one-hot encoded (giving categorical values a numerical one for use in machine learning) and separated into X and y variables effectively splitting out our target customer churn.

# One Hot Encode
df = pd.get_dummies(df, drop_first=True)# Identify X, y
y = df['Customer_Churn_Yes']
X = df.drop(['Customer_Churn_Yes'], axis=1)

A test/train split was performed and the data was standardized finally preparing it for modeling.

# Test/Train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)# Standardize the data
scaler = StandardScaler() # transform "X" features
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Pycaret offered a broad overview of 15 machine learning algorithms and their performance on the classification of customer churn. Their results were:

Additionally, a neural network was built using tensorflow for comparison.

early_stop = EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=25)model = Sequential()

model.add(Dense(21,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(18,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(15,activation='relu'))
model.add(Dropout(0.5))

# Binary Classification
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam')

The model was fit and the results were compared to the 15 previously run models based on accuracy, AUC, precision, recall and F1.

model.fit(x=X_train,y=y_train,epochs=500,validation_data=(X_test,y_test),
          callbacks=[early_stop])

After an analysis of the scores, our team selected to move forward with a Gradient Boosting Classifier.

gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train) # fit model
gb_prediction = gb_clf.predict(X_test)
gbclf_score = round(accuracy_score(gb_prediction, y_test)*100,2)
print('Accuracy Percentage', gbclf_score, '\n')
print(classification_report(y_test, gb_prediction), '\n\n')

The top 25 predictive features were separated for analysis using the following, and then visualized using matplotlib.

# Feature Importance
gb_model = gb_clf.fit(X_train, y_train)
gb_feature = pd.DataFrame({'Importance': gb_model.feature_importances_, 'Column': X.columns})
gb_feature = gb_feature.sort_values(by='Importance', ascending=False) 
print('Gradient Boosting Top 25 Features')
gb_feature[:25] # top 25 features

Visualization

gb_feature = gb_feature[:25] # top 25 features
gb_feature.plot(kind='barh',x='Column',y='Importance',figsize=(20, 10),cmap='coolwarm')
plt.title('Gradient Boosting Feature Importance \n', fontsize=16)
plt.show()

The variation in fee and price was significant, visually represented using a plotly spreadplot.

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.go_offline()churn[['Fee','Paid']].iplot(kind='spread',title='Spread Plot Fee vs Paid | Customer Churn',
                            yTitle='Fee vs Paid',colors=['red','blue'])

The our analysis of age, month and day with respect to customer churn is visualized below.

Conclusions

Through customer segmentation and customer churn analysis the following actionable recommendations were made. They include targeted strategies as well as general themes for building rapport with the client base most prone to turnover.

Recommendations

The tax office needs to target consumers with mid-range ages, specifically those that are between 25–35 years old. Additionally, clients over the age of 65 years old should be targeted during the off-season as they represent a large portion of customer churn, and the fear of illness (COVID-19) may lead to additional customer loss. These mid-range age and retired individuals should be given special time slots on Sundays each week.

To help build rapport with the mid-range demographic, the Tax office should begin offering financial wellness checks to those that are interested. A simple check of finance plans for retirement and other provided resources can go along way when looking to build trust with a client base. The office should invest in video conferencing software such as Zoom, GoToMeeting or Google Meet in order to offer alternative methods of tax preparation to their client base an offer a user friendly approach to their younger demographic.

The office should attempt to redistribute its client flow with an emphasis on taking care of clients who are middle aged, live in single income households and do business in February. The customer segmentation analysis illustrated customer churn occurring in the middle age, single income, preparer only tax returns. This leads me to believe that this demographic waits in the office for services and are processed by the first preparer available as opposed to having an administrative representative complete the returns while receiving a follow up call later in the evening or the next day (the process followed at the office). This would be a great practice for retaining those clients and cutting down wait times while allowing administrative employees to train and develop their preparation skill sets to become future preparers.

Final Thoughts

Unfortunately, this study was limited by poor record keeping and the lack of the most recent tax information. Future work should include a more comprehensive dataset with additional features such as education completed, occupation, etc. The dataset offered great insight into a small business and yielded several insights to aid growth as we continue to live through a global pandemic.