Customer Clustring with RapidMiner & PowerBI

Customer Segmentation with Clustering

This is detailed research about customer segmentation with clustering. I used RapidMiner to make the clustering and used PowerBI for all of the visuals. All steps and findings are thoroughly explained. I tried to enable the business to better understand who their customers are and what matters to them and how they can strategize accordingly.

Abstract

To beat the competition in the market, companies should know their customers very well. In this research, I used a data set of a grocery shop which includes people’s general characteristics like their birth year, education level, marital status etc. and also their spending habits in terms of where, how much, and to which products they spend. After I examine the data set, I did some improvements in terms of better research. I did customer segmentation to get a detailed analysis of a company’s ideal customers and I used the Clustering method. The purpose of this analysis is to help a business to modify its products and marketing strategies, based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only to that particular segment. I used RapidMiner and Power Bi as tools to create clustering analyses and visualizations. After the analyses, I can easily say that there are significantly different spending habits in each cluster. As you can see throughout the report, we can readily detect the preferences of the customers with the visualizations.

METHODOLOGY

i. Problem

Nowadays, we see that successful companies know their customers very well and can anticipate their needs. Companies that do this best are successful because they can divide customers into different groups that reflect the similarities between them. Today I did customer segmentation with the help of clustering. This way, the company can better understand its customers. They can modify their products according to their specific needs, behaviors, and concerns.

ii. Research Questions

In this research I wanted to answer these three questions to understand the customers’ general characteristics, their spending habits, and forecast possible improvements among the company.

What are the statistical characteristics of the customers?
What are the spending habits of the customers?
How to make more targeted marketing campaigns?

iii. Data Set Description

I found this data set on Kaggle Datasets. The original name of the data set was Customer Personality Analysis, and it is uploaded by Akash Patel but it provided by Dr. Omar Romero-Hernandez for learning purposes. The data set was very clean without null values and description of it also very explaining. I found the attributes customizable and useable for my project, that is why I choose this data set.

Attributes

People

ID: Customer's unique identifier
Year Birth: Customer's birth year
Education: Customer's education level
Marital Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Data Discovery

iv. Visualizations for Data Discovery

To understand the general characteristics of the dataset, I prepared some visualizations before preprocessing with the raw data. All of the visuals are interactive and created with PowerBI.

My main goal is to detect what can affect individuals' purchasing behavior. Below, as we can see in the 1st figure, there are 2240 individuals in the data set and they have a total of 995 kids and 1134 teens living with them. In the 2nd figure, we can see the marital status of the individuals and with this information, we can decide whether individuals' house include another adult. The only problem is there are too many categories which is not necessary for this analysis. I only need to know 'how many people are affecting the purchase?' therefore it clearly needs preprocessing.

In the 3rd and 4th graphs, customers' educational information is given. The majority are university graduates and it continues as PhD and Master. The smallest group has basic education. In the pie chart, we can see the average income according to their education level. PhD graduates have the highest average income, around 56K yearly. Master's and university graduates earn roughly the same which is almost 53K yearly. People with a basic education have the lowest yearly income, nearly 20K yearly.

The average spent amount on specific categories can be seen below, in the 5th graph and around 75% of the earnings are from wines and meat products. I noticed that 'the amount spent on wines' is coherently increasing with the education level. People with basic education spend most of their money on gold and fish products while all the other education levels' spend it on wines and meat.

Customers highly prefer store purchases according to the 6th graph but still, web and catalogue purchases are too high to be ignored. I couldn't detect any significant difference in purchase method according to their education level.

After inspecting the 7th graph, the first thing I noticed was, only 17.4% of people accept the offer in the first and second offers. Most of the customers preferred accepting an offer after the forth campaign. In all education levels, the same behaviour can be seen. The only exception is in the basic education level because they only took the third or the last campaign.

Finally, in the scatter plot we can see 'the customer number' according to their birth year. This graph aided me in easily identifying some of the necessary preprocessing actions to take. There are 3 customers born before the year 1900 which makes them over 100 years old. They are far from the bell curve and probably not making any purchases now. Also having the birth year is less efficient than having the age so I want to do something about it too.

Preprocessing

v. Preprocessing

To get the best findings, I did many preprocessing with RapidMiner. Therefore, I was able to analyze more than the given information. I created new attributes to find out more about customer characteristics, their living situation, and spending habits. Also cleaned some of the outliers to get the best results.

1. There were some missing values in the income column. I filled the missing value in the income according to mean. There was total of 24 null values, and all replaced according to the mean. I used ‘replace missing values’ operator.

2. In column “Z_CostContact" and "Z_Revenue", each customer has the same values, so I dropped them because they will not contribute anything in model building. I did it by ‘select attributes’ operator and choose subset to drop them at the same time by checking the invert selection button.

3.“Dtcustomer” column is the date a customer joined the database. I wanted to add a new column to indicate number of days a customer is registered in firm’s database (according to most recent customer in the dataset.) As a start, I used the replace operator to replace ‘- ‘by ‘.’ in the date column. Then I changed the attribute to “Nominal to Date”. After that, I generated new attribute as “Engagement Ratio” by using “date_diff“ function and this function is returning the number of days the customers started to shop in the store relative to the last recorded date.

The function that I used was (date_diff (Dt_Customer ,date_parse_custom ("06.12.2014", "dd.MM.yyyy"). This function returns results in milliseconds, so I needed to convert to another unit. Therefore, I used to ‘Normalize’ operator with range transformation method and assign each customer a number between 0 and 100 to create an engagement ratio.

4.a. The birth year of the customers’ was given so, extract the Age of each customer from their birth year with ‘generate attribute’ operator.

4.b. There are attributes as MntFishProducts MntFruits MntGoldProds MntMeatProducts MntSweetProducts MntWines and shows total amount of money spend on the specific category. I created new attribute as Spent, to extract the total amount spend by the customer.

4.c.There are attributes as NumCatalogPurchases NumStorePurchases NumWebPurchases which indicates number of purchases from given intermediary. I wanted to know total number of purchases each customer have made therefore, I created new attribute as Purchase Number by adding these three attributes.

4.d. There were so many education categories which is not needed for me so, I decreased Education categories to three by ‘replace’ operator.

Basic: Basic Education,

Graduation: Graduated from Univ.,

Postgraduate: (2n Cycle + Master + PhD)

4.e. Again, there were so many Martial Status categories, I wanted to extract the living with situation of couples. So, I replaced 'Married' and 'Together' as “2” to indicated there are two households, and I replaced 'Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd' as “1” to indicate there is one person.

4.f. There were kids and teenagers attributes so, to calculate the total number of children in household, I generated new attribute by adding them.

4.g. To calculate total number of households, I generated new attribute as Family Size by using children and living situation.

4.h. To indicate Parenthood of each individual, I generated new attribute by

(if(Children == 0, "no", "yes")) function.

5. For removing outliers, I used filter examples operator and removed individuals that older than 100 years old and individuals who have income more than 150.000.

6. For labeling the categorical features I used set role operator. Set the Education attribute as a label.

APPLICATION AND FINDINGS

i. Techniques and Algorithms Used

My main objective was dividing customers to groups to find out their income and spending habits therefore I used Clustering to identify structures within the dataset and used RapidMiner. First I needed to decide the scope of the analysis. I planned making 4 clusters in total. I was hesitant between Behavioral and Demographic segmentation because I wanted to include both their purchasing habits and identity (such as: income, family size etc.)

I started with 'set role' operator and setting ID as an id otherwise, RapidMiner automatically assigns new id numbers. It would create many problems in the visualizations. Then I added 'select attributes' operator and choose "Income, Age, FamilySize, and Spent" attributes as the variables I will use to segment. Finally, I added multiply operator to run different types of clustering with the same attributes.

Rapid Miner Part

One of the clustering analysis I choose was Clustering with k-means. In the operator I checked the ‘add cluster attribute’ and ‘add as label’ checkboxes. I tried running the model with different ‘k’ values and find out that 4 is the best one for my research. As measure types, I only choose NumericalMeasures. I connected the first output to Cluster Model Visualizer’s mod input, and second output to ‘set role’ operator and changed ‘label’ attribute as a cluster. With the multiply operator, I connected this result to Cluster Model Visualizer’s cluster input, Write CSV operator to have the dataset as CSV, and directly to the results output.

The second clustering operator that I used is Agglomerative Clustering operator. I choose the mode as Average Link because it gave me the best results and measure type as again NumericalMeasure.

ii. Summarizing The Performance of Clusters

There are total of 4 clusters and each of them seems to be fairly distributed.

Cluster 0: 444 items
Cluster 1: 516 items
Cluster 2: 629 items
Cluster 3: 640 items

Total number of id: 2229

At the first glance, we can clearly see that the main attribute that affected the clustering is the customers' income level. We can also see the difference in spending habits of each cluster. In the Matrix Table, we can see the average age, purchase number, household size, and total complaints.

Cluster_0 has the lowest income & spend and the average age of the cluster is also relatively low. Of course in the average purchase number, we can see the effects of lower income & spend.
Cluster_1 has the highest income & spend but it has not the highest average age. Even though Cluster_1 has the highest spending and average purchase number, it has the lowest household size which I found a little odd. Yet with further analysis, I can find the reasons. I also noticed that this cluster has the lowest complaint number.

Cluster summarization of RapidMiner with Cluster Model Visualizer

Cluster Model Visualizer also gave most of the findings that I have found. Additionally, we can see Cluster_2 has a 22.7% larger household size than the average

I summarized the clusters as;

Cluster 0: low spending & low income

Cluster 2: low spending & average income

Cluster 3: average spending & average income

Cluster 1: high spending & high income

The HeatMap below visualizes these findings more plainly. It was calculated according to the GrandTotal percentage.

iii. The Findings & Conclusion

Instead of spending money on unnecessary marketing campaigns, a company could separate customers into discrete groups, or segments, based on their shared characteristics, and spending habits.

Right below, detailed visuals of the clusters are presented (all interactive). By using the Slicer, I filtered and analyzed each cluster and tried to answer the questions of this research.

Overview

In the 1st figure, different cards are presented for the data discovery. 'Average household size' would give information about how many people are affecting the purchase of that cluster. Also, I included the 'average number of children to detect customers' product preferences. 'Sum of discount purchases' and 'average number of web visits' would help us to create more targeted marketing strategies according to the clusters.
According to the 2nd figure, we can decide which cluster uses which platform and again use more targeted marketing.
3rd figure will help me to understand whether education level effected the clusters' habits.
As discussed before, the spending amounts of the clusters are quite different but in the 4th figure, it is also possible to see the clusters' spending habits in terms of products.
Finally, with the 5th graph, a company could decide on how many campaigns they will offer to the different customers.

1. What are the statistical characteristics of the customers?

Cluster_1 has the highest income and spending habits. Since they have the highest spending I expected to see again high average household size although, they have the lowest, 1.97. Which indicates they are generally between 1 and 3 people in the household. Again they have the lowest average children, 0.35 which would highly affect their spending habits. Household size and children number continue to increase as Cluster_0, Cluster_3, and Cluster_2.

Cluster_0 has the highest basic education percentage with 11.9%. In Cluster_2 there is only one person with basic education which makes the percentage 0.16%. In the other clusters, there is nobody with basic education. Cluster_3 has the highest postgraduate percentage except they still does not have the highest average income. Also, Cluster_2 has a high percentage of postgraduate individuals however they have the second lowest income and spending. I can say that people with basic education affect the habits of the customers but there is no significant effect of graduate and postgraduate education.

2. What are the spending habits of the customers? & 3. How to make more targeted marketing campaigns?

As displayed in the 1st figure, the sum of discount purchases is 5171. Cluster_1 has the lowest amount of discount purchases and Cluster_3 and Cluster_2 have the largest. Consequently, it is better to offer a discount to these clusters since discounts do not essential for Cluster_1.

People generally prefer store purchases according to the funnel graph. It continues as web and catalogue purchases. when we filter the results we can see that Cluster_0 and Cluster_2 rarely prefer catalogue purchases. Cluster_1 again prefer purchasing store but their second preference is catalogue purchase. Cluster_1 and Cluster_3 have the highest catalogue purchases thus it is logical to tailor the catalogues according to these clusters.

Cluster_0 and Cluster_2 make around 50 % of their purchases in the store. Even though they spend small amounts on web purchases, they more often visit the website. For this reason, the store can add more marketing campaigns according to these clusters on its website.

In the 4th figure, we can clearly see that most money is spent on Wines and Meat products. Accordingly, most of the revenue of the shop depends on these products. Only Cluster_0 spends less money on wines than meat products. If we filter post-graduate education in the 3rd figure, they are spending more money on wines according to average so we can offer some special wine campaigns to the post-graduates.

Finally, in the 5th graph, we can see which campaigns are accepted by the customers. Cluster_0 only accepted the last or 3rd offer therefore we can say that in the first few campaigns company can focus on other clusters' needs and wants. Cluster_1 does not prefer accepting 2nd, 3rd, and 4th offers and since Cluster_1 is the most profitable cluster, the store should tailor their 1st, 5th and last campaign more associated with Cluster_1. Also, Cluster_2 does not prefer accepting offers in the first 2 offers therefore we can say that clusters with lower income and spending do not prefer the first few offers. The preferences of Cluster_3 are more evenly distributed and we can see that they have higher chances of accepting the 4th offer. Other clusters rarely choose the 4th offer so it is better to tailor the 4th campaign according to Cluster_3.

To sum up, with these findings, we can easily find the general characteristics of each cluster, such as their spending habits, which products they usually spend money on, and how many people are there in their household. With these findings, we can create a lot of visualizations and results which can help the company to forecast its promotion strategies, decide on further steps to take, and know better about their customers.

Resources

This analysis is edited version of my previous term project for the 'Intelligent Systems' lecture at Dokuz Eylul University. All the content displayed here has been created by me.

Except attributes description!

Data set: Customer Personality Analysis- Analysis of company's ideal customers.

https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis