Customer Segmentation Report for Arvato Financial Services.

Stephan Bauer
15 min readOct 1, 2021

Capstone Project Report

Project Definition

Arvato Financial Services is a company operating in the mail-order sales business in Germany. The company wants better identify persons that become more likely customers with their marketing campaigns.

Problem Statement

The Objective is to predict individuals who are more likely to become new customers from a given set of individuals.

To achieve this we use unsupervised and supervised machine learning approaches

  • Unsupervised learning to segment non customers and customers into clusters to identify the main customer and population cluster and extract information about the driving features.
  • Supervised learning to build and train a model to predict the probability of an individual to become a customer. For the supervised learning we will use different Binary Classifiers

Metrics

In order to find the correct metric this guide Metrics for Imbalanced Data helps to find one:

  • Data is very imbalanced
  • We need probability
  • We need class labels
  • Both classes are important
  • => This leads to ROC AUC or PR-AUC

The Metric used to evaluate the supervised learning model is Area Under the Receiver Operating Characteristic Curve (short ROC AUC, see ROC Crash Course, Guide to ROC curves and when to use) because we have a very imbalanced binary classification problem to solve.

Data Exploration

The given data consists of four datasets and 2 metadata files describing the features of the dataset.

Metadata

  • DIAS Information Levels — Attributes 2017.xlsx: is a top-level list of attributes and descriptions, organized by informational category.
  • DIAS Attributes — Values 2017.xlsx: is a detailed mapping of data values for each feature in alphabetical order.
  • The metadata file does not describe all features used in the dataset. Features for that no description exist are e.g.

    ‘AKT_DAT_KL’, ‘ALTERSKATEGORIE_FEIN’, ‘ALTER_KIND1’, ‘ALTER_KIND2’, , … · ‘VK_DISTANZ’, ‘VK_ZG11’

Index (Unique Identifier)

The “LNR” column can be used as index for the datasets.

Variable Value Checks

We now check which values occur in the data set and if they match with the values defined in the metadata files.

A first glance on the used values shows that the dataset contains huge number of floats also the metadata just defines int values and some string codes

Meaning of Value “…”

A check of the data set show that there are many values that are not defined in the metadata value set. Let’s identify which values that are and which columns (features) use them.

  • E.g. ANZ_HAUSHALTE_AKTIV,ANZ_STATISTISCHE_HAUSHALTE, KBA13_ANZAHL_PKW use int and floats from 0 to N -> obviously that is the total number of items counted.
  • D19_LETZTER_KAUF_BRANCHE contains categorical values like ‘D19_LEBENSMITTEL’, ‘D19_BANKEN_DIREKT’, ‘D19_BANKEN_GROSS’
  • EINGEZOGENAM_HH_JAHR GEBURTSJAHR MIN_GEBAEUDEJAHR are columns of type date (year)

Unknown Data

The features of the datasets are mainly numerical encoded nominal and ordinal values, e.g.

As you can see there are encoded values for “unknown”. These values are handled in the same way as missing values (NaNs / NULLs). That means I replace unknown and missing values by a measure of central tendency. For ordinal values the mode and mean are useful replacements for nominal just mode.

Duplicate data

Let’s check if the data sets contain records for the same individual.

There no records for the same individual. Next we check if there are same data points that means if there exist exactly the same records ignoring LNR column.

This shows that the general population and the customer dataset have little more than 40,000 identical records. That are ~5% for the general population and ~21% for the customer dataset

Handling of Data Load Errors

During the load of the datasets we face some warning that columns 18 and 19 have mixed types. These columns are CAMEO_DEUG_2015, CAMEO_INTL_2015.

A check of these two columns shows that they contain numerical encoded ordinal values but also strings “X” and “XX”

We replace them by np.NaN.

Analyse Feature Values

A check of the values used in the feature columns show that many of them contain a mix of int and float values. In addition there are columns with a wide range of different values. We see that there are date columns like EINGEZOGENAM_HH_JAHR, GEBURTSJAHR and MIN_GEBAEUDEJAHR.

These columns are converted to int.

Missing Data

Let’s now check how many data is missing.

Note that we treat unknown values as missing values

A first check shows that for most columns less than 20% of data is missing. For a few columns almost all data is missing

The following table shows that there are 10 columns for that more than 30% of data is missing. Five of them are the ALTER_KIND and TITEL_KZ columns for that more than >90% is missing. For all other columns less than 30% is missing.

Therefore we will drop the 10 columns with more than 30% missing data and keep the others. In addition we drop D19_LETZTER_KAUF_BRANCHE as this columns contains >30% values “D19_UNBEKANNT” which means also unknown.

Note: The ALTER Kind columns have a huge number of missing values as they indicate the age of child 1..4 and if values are available or missing is directly related to ANZ_KIND which indicated the number of children.

Categorical Features

Most features of the dataset contain categorical data but it is already encoded by int values. But some features still have categorical data. In order to be able to apply machine learning algorithms we need to encode them (See also https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/). The features that are not encoded categorical are

We will apply the following actions

Data Correlation

Now let’s have a look on the correlation of the data set.

The heat map shows some areas that are heavily positive correlated. Let’s have a closer look on these main correlations

The highest absolute correlation (>0.9) have these features

Detailed view on some selected very high correlated features:

Data Pre-processing

In order to prepare the data for the further analysis we will execute the following pre-processing steps:

  1. Drop additional customer data set
    We drop the columns ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’ as these are just contained in the customer dataset
  2. Handle data load errors
    We convert “X” and “XX” values in columns ‘CAMEO_DEUG_2015’, ‘CAMEO_INTL_2015’ to np.NaN
  3. Drop duplicates
    Drop identical rows.
  4. Handle unknown values
    We replace unknown values with np.NaN in order to treat them like missing data. We use the Metadata to identify unknown values. We also convert all negative values to np.NaN as we assume that this is also unknown as -1 is in general for unknown according to the metadata.
  5. Encode categorical variables
    Encoding of categorical variables is not required as the only not encoded categorical features are dropped.
  1. Drop columns — Too many missing values
    Drop all columns with more than 30% missing data. That are:

    ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’,
    ‘ALTER_HH’, ‘AGER_TYP’,
    ‘EXTSEL992’, ‘KK_KUNDENTYP’, ‘KBA05_BAUMAX’, ‘TITEL_KZ’
  2. Drop Columns — Too much correlation
    Drop columns with abs(correlation ) > 0.9.
    E.g. ‘CAMEO_DEU_2015’,’LP_STATUS_GROB’,’LP_FAMILIE_GROB’,…, ‘PLZ8_ANTG3’

Review Data Cleaning

Compare Missing Values

Before cleaning

After cleaning

There is a significant increase of columns with less missing data. The main reason are:

  • Dropping of columns with more than 30% missing data
  • some “unknown” values were replaced values by np.NaN
  • Duplicate rows have been removed. There are for many columns the same number of removed missing values ~45730. This is the cause the decrease of number of columns for bin 2 (10–20%) and increase of bin 1 (0% — 10%)

Correlation

You can see that the very high correlated data areas are removed after the cleaning.

Implementation Step 1 - Unsupervised learning

In order to segment the customers with unsupervised learning we use a clustering algorithm.
For clustering there is number of popular algorithms. For the algorithm selection we focus on the ones that scikit-learn provides and the article clustering algorithms with python or How to Combine PCA and K-means Clustering describe.

According to the references the most popular algorithms are

  • Affinity Propagation
  • Agglomerative Clustering
  • BIRCH
  • DBSCAN
  • K-Means
  • Mini-Batch K-Means
  • Mean Shift
  • OPTICS
  • Spectral Clustering
  • Mixture of Gaussians

The bold one are the one(s) we use.

The dimension of the dataset is quite high so that it is worth to consider a reduction of the dimensionality which will increase the performance and in many cases the accuracy of algorithm. In particular the popular K-means which we will use will profit from it (See e.g. PCA with k-means https://365datascience.com/tutorials/python-tutorials/pca-k-means/ )

The main steps outlined in the combination of all guides are

The approach is

  1. Load and Prepare Data
  2. PCA — Principal Component Analysis
    This algorithms is also provided by scikit-learn. It will transform the given space of features to new space with basis vectors that are linear combinations of the given features so that the new vectors point in direction of the maximum variance. For more information how to execute and interpret the values see In Depth: Principal Component Analysis
    * Impute missing data
    * Standardize data
    * Execute PCA
    *Define the level of explained variance to be kept -> this defines number of components to keep
    *Train (FIT) PCA on reduced number of components
  3. K-means
    Finally we cluster the data by applying KMeans on the data. We use 3 steps for this.
    a. Fit KMeans on the general population data
    b. Determine the best number of clusters
    c. Cluster general and customer data on defined number of clusters

PCA — Principal Component Analysis

Executing the PCA and plotting the Explained variance over the number of components show the following graph.

The dashed line shows that 70% of the variance is explained by 81 components.

Let’s now check how much variance is explained by the first components and which features are the driving ones for these ones.

Customer Segmentation (Clustering)

As mentioned in the section introduction we focus on the KMeans algorithm for unsupervised clustering.

Determine number of Clusters

In order to find the best number of clusters we use the yellowbricks library that applies the elbow method that is also referenced in the guides mentioned at the beginning of this chapter.

After we have determined the number of clusters we will apply PCA on cleaned customer dataset and run then the KMeans algorithm on customer and general population dataset for the given number of clusters.

Finally we can compare the proportion per cluster between customer and general population.

The elbow is at k=9 clusters. As this is not a clear elbow we will compare the segmentation for different number n =[8,…14] of clusters.

For n=10,11,12 clusters the result are listed in table form below

The one in the middle (clusters = 11) seems to be best suited as the differences between customer and general population in the clusters is best. In the first 4 clusters is more than 67% of all customer data whereas just 46% of general population is in these four.

As you can see there are much more customers in cluster 4 compared to general population (~18.5% difference). Also in cluster 0, the customer proportion is almost 3% higher. In these 2 clusters are more than 45% of the customers compared to ~24% of general population.

If we take the first 4 clusters ordered by customers_pct we have more than 67% of all customers included which is 21.5% compared to general population.

In order to understand the meaning of the clusters we investigate the main positive and negative drivers of features for cluster main customer clusters. The results are shown below.

Let’s now have a look on the differences of proportion between customers and general population fir the features that the main positive drivers for customers.

The main obviously differences in positive weights for customers are

  • Much more FINANZ_VORSORGER (finance provision)
  • CJT_TYP6 and CJT_TYP5 and CJT_TYP4 is much higher (not described)
  • KOMBIALTER of customers is higher

The main obviously differences in negative weights for customers are

  • FIANZ_SPARER, FINANZ_ANLEGER is much higher
  • PRAEGENDE_JUGENDJAHRE are dominated by 40er-60er

Implementation Step 2 - Supervised Learning

For the supervised learning we follow these steps

  1. Load Data and clean Data
    by applying the implemented PreProcessor
  2. 1. Impute missing data
    Note:
    Scaling is not required. Do Decision Trees need Feature Scaling? | by Praveen Thenraj | Towards Data Science).
  3. Split Data into Test and validation Data
    It’s important to split the data before we impute missing values in order to avoid that the training data learns from validation data
  4. Train first model
  5. Compare the segmentation of the datasets
    We compare the mailout training and test data to the customer data in order to see if the data is clustered similar -> this is just for investigation and not needed for the model
  6. Run GridSearch
  7. Split the given training dataset mailout_train into X_train, y_train where X_train are all features and y_train are the label (RESPONSE column).
  8. Build a model pipeline
  9. Run GridSearch on the given datasets for different parameters and algorithms

First Supervised Model

We start to train XGBClassifier with no tuning on the mailout training data set. In order to validate the result we split the mailout training data into training (80%) and validation data (20%) by random.

The result is a ROCAUC score of 0.5 which is the worst we can get and this confusion matrix

For RandomForrestClassifier and SVC the result looks identical.

The GaussianNBClassifier it looks a bit better:

This is because of a very imbalanced distribution of the labels

Just 426 of 34369 labes are positive which is ~1.2% is labeld with RESPONSE=1.

Because of this we use the BalanceRandomForest classifier. This results in a much better prediction. The ROCAUC score is in this case 0.657 and confusion matrix is

Refinement 1 –GridSearch

In order to improve the results of default parameters of the classifiers we use GridsSearch to train the algorithms for the given training data with different hyperparameters to investigate if this will improve the prediction further.

We run the GridSearch for

  • BalancedRandomForestClassifier
    ‘n_estimators’: [30,90,250],
    ‘max_depth’: [1, 2, 3],
  • XGBClassifier
    ‘n_estimators’: [30,90,250],
    ‘max_depth’ : [1,2,3],
    ‘learning_rate’ : [0.15,0.3],
    ‘scale_pos_weight’ : [scale_pos_weight, 1.5*scale_pos_weight],

Scale_pos_weight is the ratio of (training data with label = 1) / (training data ith label = 0)

This search results in a best estimator with the following parameters

Using this estimator we receive a ROCAUC score of 0.721 and get this confusion matrix

Compare Segmentation

First let’s check how the segmentation of the training and test data looks compared to the overall customer and general population datasets

You can see that the proportion of the training data is also higher in the main customer clusters and that the proportion of individuals that have a positive RESPONSE is also higher the main customer clusters 4 and 0 compared to the overall mailout_training dataset. Note that in cluster 4 and 0 the positive RESPONSE quote is much higher compared to the overall mailout distribution share.

Second Model — Using General Population and Customer data as training set

As improvement of the results of the previous model we now use the general population and customer data sets and label them with 0 for general population and 1 for customers. The advantage of this generated dataset is that these data set contains much more data and in particular much more data with positive labels ~150K. We can then test if models trained on this created dataset will work for predictions on the mailout dataset by using the complete mailout training data set as validation set.

Let’s start and try this for some classifiers with default parameters.

Using XGBoost with default parameters results in a quite good result of

ROCAUC: 0.718

Using RandomForest is even better with

Refinement — GridSearch

Like we did above for the mailout training dataset we will try GridSearchCV again. For the RandomForest this breaks in most cases because of too less memory. Therefore we start to run the GridSearch on the XGBClassifier only.

We now try a limited GridSearchCV on Randomforest for n_estimators =70 and some different max_depth (1,2,3,None). The result is

This is the best score we have so far.

Results

Segmentation

The customer segmentation shows that there is a clear over representation in cluster number four. This cluster contains >32.5% of all customers. Cluster 4,0,6,9 contain more than 67% of all customers. So focusing on these clusters for campaigns and mainly on cluster3 should be result in better response ratio.

Prediction

Using Machine Learning to predict responses we find that the training models on the mailout training dataset will not work for many models with default parameters because of the heavily imbalanced data. Once we are suing classifiers or parameters that consider the imbalance like BalancedRandomForestClassifier or XGBClassifier with scale_pos_weight parameter we get better results.

Using the general population and customer data set to build a new training dataset and validate against the mailout gives much better results. For the RandomForestClassifier with default Parameter we already get a ROCAUC score and confusion matrix like below

Using GridSearch for RandomForest for the large dataset failed but worked for very limited search parameter space. Using default parameters on the mailout test result in a Kaggle score of 0.86738 and using

gives

Gives a kaggle score of 0.87088

Improvement

In order to improve the data even more we could try to combine the XGBClassifier, the RandomForestClassifier and the KMeans classification. E.g. the XGBClassifier give this validation result for using the mailout training as validation

Whereas RandomForestClassifier gives

As you can see XGB has a better sensitivity

Sensitivity = 458 / (74 + 458) = 0.861 (XGBC) compared to 0.705 (RFC)

and RandomForestClassifier a better

Specificity = 33846 / (33646/8584) = 0.798 (RFC) vs 0.576 (XGBC)

References:

--

--