Bank Customer Segmentation: Analysis with K-Means, DBscan, and OPTICS

1. Introduction

In this study, a data set containing the usage behavior of approximately 9000 credit card users in the last six months will be used. Dividing customer behavior into several groups is necessary to obtain an effective and efficient credit card marketing strategy.

The objectives of this notebook are as follows:

- Explore the data set using data visualization techniques.

- Perform data preprocessing before using models.

- Group customers into clusters using different clustering models.

- Interpret and analyze the created groups (profiling).

- Provide marketing suggestions based on the profiling results and analyses.

Code

import numpy as np
                    import pandas as pd
                    import matplotlib.pyplot as plt
                    import seaborn as sns
                    from sklearn.cluster import KMeans, OPTICS
                    from sklearn.preprocessing import StandardScaler
                    from pyclustertend import hopkins
                    from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, silhouette_samples
                    from scipy.spatial.distance import pdist, squareform
                    from mpl_toolkits.mplot3d import Axes3D
                    from yellowbrick.cluster import SilhouetteVisualizer
                    import umap
                    from sklearn.neighbors import NearestNeighbors
                    import matplotlib.cm as cm
                    from sklearn.cluster import DBSCAN

Code

url = 'https://raw.githubusercontent.com/EmreToktay/pydash/main/Creditcard.csv'
                    df = pd.read_csv(url)

2. Initial Review of the Data Set

First, let's get acquainted with the data set we have at hand;

Code

print(df.shape)
                    df.head()

(8950, 19)

	CUST_ID	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE	City
0	C10001	40.900749	0.818182	95.40	0.00	95.4	0.000000	0.166667	0.000000	0.083333	0.000000	0	2	1000.0	201.802084	139.509787	0.000000	12	Bursa
1	C10002	3202.467416	0.909091	0.00	0.00	0.0	6442.945483	0.000000	0.000000	0.000000	0.250000	4	0	7000.0	4103.032597	1072.340217	0.222222	12	Şanlıurfa
2	C10003	2495.148862	1.000000	773.17	773.17	0.0	0.000000	1.000000	1.000000	0.000000	0.000000	0	12	7500.0	622.066742	627.284787	0.000000	12	Çorum
3	C10004	1666.670542	0.636364	1499.00	1499.00	0.0	205.788017	0.083333	0.083333	0.000000	0.083333	1	1	7500.0	0.000000	NaN	0.000000	12	Balıkesir
4	C10005	817.714335	1.000000	16.00	16.00	0.0	0.000000	0.083333	0.083333	0.000000	0.000000	0	1	1200.0	678.334763	244.791237	0.000000	12	Batman

Some detail regarding the headers;

BALANCE: The available usage of your credit card
BALANCE_FREQUENCY: How frequently the balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES: The amount of shopping made from the account
ONEOFF_PURCHASES: The highest shopping amount made at one time
INSTALLMENTS_PURCHASES: The amount of shopping made in installments
CASH_ADVANCE: Cash Advance
PURCHASES_FREQUENCY: How frequently shopping is done, score between 0 and 1 (1 = shopping is done frequently, 0 = shopping is not done frequently)
ONEOFF_PURCHASES_FREQUENCY: How frequently one-off purchases are made (1 = frequently done, 0 = not frequently done)
PURCHASES_INSTALLMENTS_FREQUENCY: How frequently installment purchases are made (1 = frequently done, 0 = not frequently done)
CASH_ADVANCE_FREQUENCY:How frequently the cash taken in advance is paid
CASH_ADVANCE_TRX: The number of transactions made with “Cash Advance”
PURCHASES_TRX: The number of shopping transactions made
CREDIT_LIMIT: The credit card limit of the user
PAYMENTS: The amount of payment made by the user
MINIMUM_PAYMENTS: The amount of minimum payment made by the user
PRC_FULL_PAYMENT: The percentage of full payment made by the user
TENURE: The tenure of the user's credit card service

Let's examine the data types;

Code

df.info()


                    <class 'pandas.core.frame.DataFrame'>
                    RangeIndex: 8950 entries, 0 to 8949
                    Data columns (total 19 columns):
                     #   Column                            Non-Null Count  Dtype
                    ---  ------                            --------------  -----
                     0   CUST_ID                           8950 non-null   object
                     1   BALANCE                           8950 non-null   float64
                     2   BALANCE_FREQUENCY                 8950 non-null   float64
                     3   PURCHASES                         8950 non-null   float64
                     4   ONEOFF_PURCHASES                  8950 non-null   float64
                     5   INSTALLMENTS_PURCHASES            8950 non-null   float64
                     6   CASH_ADVANCE                      8950 non-null   float64
                     7   PURCHASES_FREQUENCY               8950 non-null   float64
                     8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
                     9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
                     10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
                     11  CASH_ADVANCE_TRX                  8950 non-null   int64
                     12  PURCHASES_TRX                     8950 non-null   int64
                     13  CREDIT_LIMIT                      8949 non-null   float64
                     14  PAYMENTS                          8950 non-null   float64
                     15  MINIMUM_PAYMENTS                  8637 non-null   float64
                     16  PRC_FULL_PAYMENT                  8950 non-null   float64
                     17  TENURE                            8950 non-null   int64
                     18  City                              8950 non-null   object
                    dtypes: float64(14), int64(3), object(2)
                    memory usage: 1.3+ MB

Basic statistical values for each header:

Code

df.describe()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
count	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8949.000000	8950.000000	8637.000000	8950.000000	8950.000000
mean	1564.474828	0.877271	1003.204834	592.437371	411.067645	978.871112	0.490351	0.202458	0.364437	0.135144	3.248827	14.709832	4494.449450	1733.143852	864.206542	0.153715	11.517318
std	2081.531879	0.236904	2136.634782	1659.887917	904.338115	2097.163877	0.401371	0.298336	0.397448	0.200121	6.824647	24.857649	3638.815725	2895.063757	2372.446607	0.292499	1.338331
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	50.000000	0.000000	0.019163	0.000000	6.000000
25%	128.281915	0.888889	39.635000	0.000000	0.000000	0.000000	0.083333	0.000000	0.000000	0.000000	0.000000	1.000000	1600.000000	383.276166	169.123707	0.000000	12.000000
50%	873.385231	1.000000	361.280000	38.000000	89.000000	0.000000	0.500000	0.083333	0.166667	0.000000	0.000000	7.000000	3000.000000	856.901546	312.343947	0.000000	12.000000
75%	2054.140036	1.000000	1110.130000	577.405000	468.637500	1113.821139	0.916667	0.300000	0.750000	0.222222	4.000000	17.000000	6500.000000	1901.134317	825.485459	0.142857	12.000000
max	19043.138560	1.000000	49039.570000	40761.250000	22500.000000	47137.211760	1.000000	1.000000	1.000000	1.500000	123.000000	358.000000	30000.000000	50721.483360	76406.207520	1.000000	12.000000

Based on the percentile values of each feature in the table above, it may be concluded that CASH_ADVANCE_TRX, PURCHASES_TRX and TENURE features may not be continuous. However, they should be checked to prove this hypothesis.
Looking at the count of each feature, there are some missing values in the CREDIT_LIMIT and MINIMUM_PAYMENTS columns.
Looking at the percentiles, the distributions of some features are heavily skewed and require more detailed analysis.
In conclusion, visualizing and analyzing the charts can help in understanding the data more in-depth.

3. Preprocessing

Analysis and proportion of null values in the dataset:

Code

df.isna().mean()*100


                    CUST_ID                             0.000000
                    BALANCE                             0.000000
                    BALANCE_FREQUENCY                   0.000000
                    PURCHASES                           0.000000
                    ONEOFF_PURCHASES                    0.000000
                    INSTALLMENTS_PURCHASES              0.000000
                    CASH_ADVANCE                        0.000000
                    PURCHASES_FREQUENCY                 0.000000
                    ONEOFF_PURCHASES_FREQUENCY          0.000000
                    PURCHASES_INSTALLMENTS_FREQUENCY    0.000000
                    CASH_ADVANCE_FREQUENCY              0.000000
                    CASH_ADVANCE_TRX                    0.000000
                    PURCHASES_TRX                       0.000000
                    CREDIT_LIMIT                        0.011173
                    PAYMENTS                            0.000000
                    MINIMUM_PAYMENTS                    3.497207
                    PRC_FULL_PAYMENT                    0.000000
                    TENURE                              0.000000
                    City                                0.000000
                    dtype: float64

There are empty values in the Minimum_payments and Credit_Limits, and we are addressing these for the continuation of the analysis;

Code

df.loc[(df['MINIMUM_PAYMENTS'].isnull()==True),'MINIMUM_PAYMENTS']=df['MINIMUM_PAYMENTS'].mean()
                    df.loc[(df['CREDIT_LIMIT'].isnull()==True),'CREDIT_LIMIT']=df['CREDIT_LIMIT'].mean()

Code

df.isna().mean()*100


                    CUST_ID                             0.0
                    BALANCE                             0.0
                    BALANCE_FREQUENCY                   0.0
                    PURCHASES                           0.0
                    ONEOFF_PURCHASES                    0.0
                    INSTALLMENTS_PURCHASES              0.0
                    CASH_ADVANCE                        0.0
                    PURCHASES_FREQUENCY                 0.0
                    ONEOFF_PURCHASES_FREQUENCY          0.0
                    PURCHASES_INSTALLMENTS_FREQUENCY    0.0
                    CASH_ADVANCE_FREQUENCY              0.0
                    CASH_ADVANCE_TRX                    0.0
                    PURCHASES_TRX                       0.0
                    CREDIT_LIMIT                        0.0
                    PAYMENTS                            0.0
                    MINIMUM_PAYMENTS                    0.0
                    PRC_FULL_PAYMENT                    0.0
                    TENURE                              0.0
                    City                                0.0
                    dtype: float64

We are dropping two headers that are not necessary for Customer Segmentation; these headers will be added back after the analysis.

Code

df = df.drop(columns=['City', 'CUST_ID'])

4. Exploratory Data Analysis (EDA) 🔍

To visualize the data, I used the Sweetviz library. Regarding Sweetviz; it is a data visualization library for Python. It is designed for the purpose of Exploratory Data Analysis (EDA). It quickly visualizes one or two data frames, producing detailed and visually appealing reports.

Code

import sweetviz as sv
                    
                    # Veriyi analiz edin
                    report = sv.analyze(df)
                    
                    # Raporu doğrudan Jupyter Notebook'ta göstermek için aşağıdaki kodu kullanabilirsiniz (isteğe bağlı)
                    report.show_notebook()

Except for PURCHASES_FREQUENCY, none of the headers have a balanced distribution. Moreover, even PURCHASES_FREQUENCY does not have a normal distribution!
As mentioned in the previous section, there is a significant skewness in some headers. As a result, a more detailed examination is needed to identify outliers. Also, conventional clustering algorithms may not be efficient enough.

Data scaling is often used to ensure machine learning algorithms operate more effectively. Specifically, many algorithms can produce misleading results due to different scales among features. Therefore, it's important to bring all features to a similar scale.

The scaling method we used is StandardScaler. StandardScaler scales each feature to have a mean of 0 and a standard deviation of 1. This is also known as z-score normalization.

Code

scaler = StandardScaler()
                    df_scaled = scaler.fit_transform(df)

With this code, our data has been successfully scaled and is ready for model training or other analyses.

Hopkins Statistic

Before starting the clustering analysis, it is important to check whether the dataset is suitable for clustering. For this purpose, the Hopkins Statistic is used to check whether the dataset is randomly distributed.

The Hopkins Statistic is a test that measures how suitable a dataset is for clustering. A value close to 0.5 indicates that the data has a random distribution, and therefore is not suitable for clustering. On the other hand, if the value is greater than 0.7, the dataset is likely suitable for clustering.

Code

hopkins_score = hopkins(df_scaled, sampling_size=len(df_scaled))
                    print(hopkins_score)

0.034992587280613066

According to this result, the value of 0.0351 is quite low for the Hopkins Statistic, indicating that the dataset does not have a clear structure or cluster, and thus may not be suitable for clustering.

Having previously looked at the overall structure of the dataset, we had seen excessive outliers and skewness, so using the Hopkins score over the entire data could be misleading. Therefore, I applied the Hopkins test again on a random 10% of the data;

Code

import numpy as np
                    from random import sample
                    from sklearn.neighbors import NearestNeighbors
                    from numpy.random import uniform
                    from math import isnan
                    
                    def hopkins(X):
                        d = X.shape[1]
                        n = len(X)
                        m = int(0.1 * n)
                        nbrs = NearestNeighbors(n_neighbors=1).fit(X)
                     
                        rand_X = sample(range(0, n, 1), m)
                     
                        ujd = []
                        wjd = []
                        for j in range(0, m):
                            u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
                            ujd.append(u_dist[0][1])
                            w_dist, _ = nbrs.kneighbors(X[rand_X[j]].reshape(1, -1), 2, return_distance=True)
                            wjd.append(w_dist[0][1])
                     
                        H = sum(ujd) / (sum(ujd) + sum(wjd))
                        if isnan(H):
                            print(ujd, wjd)
                            H = 0
                     
                        return H
                    
                    # Hopkins Testini Gerçekleştir
                    hopkins_value = hopkins(df_scaled)
                    hopkins_result = 'Sonuç: {:.4f}'.format(hopkins_value)
                    print('.: Hopkins Testi :.')
                    print(hopkins_result)
                    if 0.7 < hopkins_value < 0.99:
                        print('>> Yukarıdaki sonuca göre, yüksek bir Clusterleme eğilimi var (anlamlı kümeler içeriyor)')
                        print('.:. Sonuçlar: H0 Kabul .:.')
                    else:
                        print('>> Yukarıdaki sonuca göre, anlamlı kümeler yok')
                        print('.:. Sonuçlar: H0 Red .:.')

.: Hopkins Testi :.
                    Sonuç: 0.9672
                    >> Yukarıdaki sonuca göre, yüksek bir kümeleme eğilimi var (anlamlı kümeler içeriyor)
                    .:. Sonuçlar: H0 Kabul .:.

In the initial Hopkins statistic test conducted on the entire dataset, we obtained a low value regarding how clustering-prone the data is compared to a random distribution. This result might indicate that at first glance, the data does not contain distinct clusters or the separation of clusters is challenging.

The result of the Hopkins test conducted on a random 10% subset showed that the data has a high tendency for clustering.

Consequently, in the analysis of extensive data sets, there are advantages to conducting analysis both on the entire data set and on random sub-sets. The results of both approaches provide a more holistic understanding of the overall structure of the dataset and its clustering tendency.

It's time to examine the average of the outliers found in the dataset. To detect outliers, we used the IQR (Interquartile Range) method. ;

Code

outlier_percentage = {}
                    for feature in df:
                        tempData = df.sort_values(by=feature)[feature]
                        Q1, Q3 = tempData.quantile([0.25, 0.75])
                        IQR = Q3 - Q1
                        Lower_range = Q1 - (1.5 * IQR)
                        Upper_range = Q3 + (1.5 * IQR)
                        
                        outlier_count = ((tempData < Lower_range) | (tempData > Upper_range)).sum()
                        outlier_perc = round((outlier_count / tempData.shape[0]) * 100, 2)
                        outlier_percentage[feature] = outlier_perc
                    
                    outlier_percentage


                     {'BALANCE': 7.77,
                     'BALANCE_FREQUENCY': 16.68,
                     'PURCHASES': 9.03,
                     'ONEOFF_PURCHASES': 11.32,
                     'INSTALLMENTS_PURCHASES': 9.69,
                     'CASH_ADVANCE': 11.51,
                     'PURCHASES_FREQUENCY': 0.0,
                     'ONEOFF_PURCHASES_FREQUENCY': 8.74,
                     'PURCHASES_INSTALLMENTS_FREQUENCY': 0.0,
                     'CASH_ADVANCE_FREQUENCY': 5.87,
                     'CASH_ADVANCE_TRX': 8.98,
                     'PURCHASES_TRX': 8.56,
                     'CREDIT_LIMIT': 2.77,
                     'PAYMENTS': 9.03,
                     'MINIMUM_PAYMENTS': 8.65,
                     'PRC_FULL_PAYMENT': 16.47,
                     'TENURE': 15.26}

According to these results, in some features; for instance, BALANCE_FREQUENCY, PRC_FULL_PAYMENT, and TENURE the percentage of outliers is quite high. It's necessary to evaluate the effects of these features on the dataset and how to handle these outliers. On the other hand, in some features; there are no outliers in PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY.

Outliers could have adverse effects on our analyses and modelings, hence, it's important to develop strategies regarding these values. These strategies could be in the form of deletion, transformation, or handling them via another method.

It's not always obligatory to deal with outliers as in some cases it could even be misleading. However, in the subsequent analyses, I created a dataset where actions have been taken on outliers, thus had the opportunity to examine different results.

Code

def preprocess_data(data):
                        feature_boundaries = {
                            'BALANCE': [0, 500, 1000, 3000, 5000, 10000],
                            'PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'ONEOFF_PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'INSTALLMENTS_PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'CASH_ADVANCE': [0, 500, 1000, 3000, 5000, 10000],
                            'CREDIT_LIMIT': [0, 500, 1000, 3000, 5000, 10000],
                            'PAYMENTS': [0, 500, 1000, 3000, 5000, 10000],
                            'MINIMUM_PAYMENTS': [0, 500, 1000, 3000, 5000, 10000]
                        }
                    
                        frequency_boundaries = {
                            'BALANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'ONEOFF_PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PURCHASES_INSTALLMENTS_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'CASH_ADVANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PRC_FULL_PAYMENT': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
                        }
                    
                        trx_boundaries = {
                            'PURCHASES_TRX': [0, 5, 10, 15, 20, 30, 50, 100],
                            'CASH_ADVANCE_TRX': [0, 5, 10, 15, 20, 30, 50, 100]
                        }
                    
                        def assign_range(column, boundaries):
                            new_column = column + '_RANGE'
                            data[new_column] = 0
                            for idx, boundary in enumerate(boundaries):
                                if idx == len(boundaries) - 1:
                                    data.loc[data[column] > boundary, new_column] = idx + 1
                                else:
                                    data.loc[(data[column] > boundary) & (data[column] <= boundaries[idx + 1]), new_column] = idx + 1
                    
                        for column, boundaries in feature_boundaries.items():
                            assign_range(column, boundaries)
                    
                        for column, boundaries in frequency_boundaries.items():
                            assign_range(column, boundaries)
                    
                        for column, boundaries in trx_boundaries.items():
                            assign_range(column, boundaries)
                    
                        columns_to_drop = [
                            'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES',
                            'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
                            'PURCHASES_FREQUENCY',  'ONEOFF_PURCHASES_FREQUENCY',
                            'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
                            'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
                            'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT'
                        ]
                        df_wo = df.drop(columns=columns_to_drop)
                    
                        return df_wo

Code

df_wo = preprocess_data(df)

This data preprocessing function has been created to categorize certain columns into categorical ranges. Using predefined boundaries for these columns, new features have been derived based on these boundaries. While retaining the numerical values of the relevant features, new columns have been created to determine the categorical ranges corresponding to these values.

Feature Boundaries:

Boundaries representing value ranges have been determined for certain columns. For example, for the ‘BALANCE’ column, the boundaries have been defined as: 0, 500, 1000, 3000, 5000, 10000. These boundaries are used to determine in which range the values of a particular feature fall. Implementation of Boundaries:

A helper function named assign_range has been used for each column. This function takes the values of a particular column and converts these values into a categorical column based on the defined boundaries. For example, an observation with a value of 750 in the ‘BALANCE’ column will have a value of 2 in the ‘BALANCE_RANGE’ column, as this value falls between 500 and 1000. Removal of Unnecessary Columns:

The original numerical columns have become redundant once the columns are converted into categorical ranges. Therefore, these columns have been removed from the dataset. In conclusion, this function is used to convert certain columns into categorical ranges, and it removes the original numerical columns from the dataset following this conversion.

5. Clustering Models

In the review, we will try different clustering methods with different methodologies to find the most ideal one. For clustering models, I will use the K-Means, DBSCAN, and OPTICS methods. For K-means, I will use the unprocessed dataset, the dataset processed for outliers, and the datasets applied with PCA and UMAP. As DBSCAN and OPTICS are not significantly affected by outliers by nature, results will be obtained using the normal, PCA, and UMAP datasets.

Code

results = {
                        "KMeans": {
                            "normal": {},
                            "umap": {},
                            "pca": {},
                            "outlier_free": {}
                        },
                        "DBSCAN": {
                            "normal": {},
                            "umap": {},
                            "pca": {}
                        },
                        "OPTICS": {
                            "normal": {},
                            "umap": {},
                            "pca": {}
                        }
                    }

This code constructs a structure to store the results and performance metrics of different clustering algorithms. Specifically, we are evaluating the results of KMeans, DBSCAN and OPTICS clustering algorithms. Additionally, we will also test these algorithms on normal, UMAP reduced, PCA reduced, and outlier-free datasets. The schema of the roadmap I made for the analysis can be seen in this code. To briefly mention.

KMeans: It is an algorithm to cluster the data into a certain "k" number of clusters.

DBSCAN: It is a density-based clustering algorithm and does not require specifying the number of clusters in advance.

OPTICS: It is a generalized version of DBSCAN, a density-based clustering algorithm.

UMAP: It is a dimension reduction method used to reduce the data to a lower-dimensional space.

PCA: It is the abbreviation for Principal Component Analysis and is used to represent the data with fewer features.

Silhouette Coefficient: A metric measuring the internal cohesion of a cluster and the separation between clusters.

Calinski-Harabasz Index: A score measuring the quality of cluster dispersion.

Davies-Bouldin Index: It measures the ratio of intra-cluster similarities to inter-cluster similarities. Lower values indicate better clustering.

5.1 K-Means

K-Means is a clustering algorithm used to divide the data into ‘k’ number of clusters. The algorithm operates by randomly initializing each cluster center, and then iteratively updating these centers to be the average of the data points in the clusters. This process continues until a defined criterion (e.g., a situation where the centers do not move) is met. The elbow method and Silhouette score were used as supportive to the K-Means process;

Elbow Method One of the toughest decisions for the K-means algorithm is determining how many clusters the data should be divided into. The elbow method is a common technique used for this decision. This method involves calculating the total within-cluster sum of squares (WCSS) for different k values. As the value of k increases, WCSS decreases. However, after a certain k value, this reduction amount becomes insignificant. This point called the "elbow" helps us to determine the optimal k value.

Silhouette Score The Silhouette score measures how well the clusters are defined. This score takes values between -1 and 1. A value close to 1 indicates that the points are similar within their clusters and different from other clusters, while a value close to -1 indicates that these points are clustered incorrectly. A value of 0 indicates uncertainty about how far apart the clusters are from each other.

5.1.1 K-Means Normal

Code

k_values = range(1, 11)  
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_scaled)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx() 
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Evaluation of the Graph

Consider the following when evaluating the graph:

Elbow Point: When examining the WCSS graph (blue line), look for the k value where WCSS does not decrease significantly. This point can serve as a clue to determine the optimal k value.

Silhouette Score: When examining the silhouette score graph (orange line), look for the k value with the highest score. A high silhouette score indicates well-defined clusters.

By using these two measurements together, you can determine the most suitable k value for the dataset.

Code

kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_scaled)
                    labels = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    ax = fig.add_subplot(111, projection='3d')
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
                    
                    for i, color in enumerate(colors):
                        ax.scatter(df_scaled[labels == i, 0], df_scaled[labels == i, 1], df_scaled[labels == i, 2], 
                                   c=color, label=f'Cluster {i+1}', s=50)
                    
                    ax.set_title("3D Scatter Plot of Clusters")
                    ax.set_xlabel("Feature 1")
                    ax.set_ylabel("Feature 2")
                    ax.set_zlabel("Feature 3")
                    ax.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_scaled, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(4):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 4)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--")  
                    plt.show()
                    
                    cluster_counts = np.bincount(labels)
                    total_count = len(labels)
                    percentages = (cluster_counts / total_count) * 100
                    
                    plt.figure(figsize=(8, 8))
                    plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(4)], colors=colors, autopct='%1.1f%%',
                            shadow=True, startangle=140)
                    plt.title("Percentage Distribution of Clusters")
                    plt.show()

1. 3D Cluster Distribution Graph This graph shows how clusters are distributed in a 3-dimensional space in the scaled dataset. Each color represents a different cluster.

Observation: If the clusters are distinctly separated from each other, it indicates that the K-Means algorithm has clustered well.

2. Silhouette Graph The silhouette graph evaluates how close each data point is to other data points in its own cluster, and how far it is from the nearest other cluster.

Bands: Each band represents a specific cluster. The width of the band represents the number of data points in that cluster, while its height represents the silhouette score.

Red Line: Average silhouette score. We can say that clusters with values close to or above this line are well-defined.

Observation: If the bands are near or above the red line and the band widths are similar, it indicates that the clusters are balanced and well-defined.

3. Cluster Distribution Percentages Graph This pie chart shows the percentage distribution of data points falling into each cluster.

Observation: If the slices are of similar size, it indicates a balanced distribution of clusters. However, if you see one or more slices significantly larger or smaller than the others, it indicates that certain clusters have more or fewer data points compared to others.

Code

kmeans = KMeans(n_clusters=3)  
                    labels = kmeans.fit_predict(df_scaled)  
                    
                    
                    results["KMeans"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels)
                    results["KMeans"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels)
                    results["KMeans"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels)
                    
                    for metric, value in results["KMeans"]["normal"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.25,
Calinski-Harabasz Index: 1604.40,
Davies-Bouldin Index: 1.60

Evaluation of Results You can find the meaning and how to evaluate the results obtained when the code is run below:

Silhouette Coefficient: This value ranges between -1 and 1. A value closer to 1 indicates well-defined clusters, while a value closer to -1 indicates wrongly defined clusters. 0 indicates that clusters are close to each other or overlapping.

Calinski-Harabasz Index: A higher value indicates better defined clusters. This criterion measures the ratio between the density and distribution of clusters.

Davies-Bouldin Index: Lower values indicate better defined clusters. This index measures the similarity ratio for each cluster and takes the average of this ratio.

By using these metrics, you can compare the quality of clustering results obtained with different algorithms or with different parameters of the same algorithm, and decide which approach is most suitable for your data.

I will share the general evaluation of these results at the end of the analysis.

5.1.2 K-Means Outlier Adjusted K-Means

Code

df_wo = preprocess_data(df)

Code

k_values = range(1, 11)  
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_wo)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_wo, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code


                    kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_wo)
                    labels = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    ax = fig.add_subplot(111, projection='3d')
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
                    
                    for i, color in enumerate(colors):
                        ax.scatter(df_wo[labels == i].iloc[:, 0], df_wo[labels == i].iloc[:, 1], df_wo[labels == i].iloc[:, 2], 
                                   c=color, label=f'Cluster {i+1}', s=50)
                    
                    ax.set_title("3D Scatter Plot of Clusters")
                    ax.set_xlabel("Feature 1")
                    ax.set_ylabel("Feature 2")
                    ax.set_zlabel("Feature 3")
                    ax.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_wo, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(3):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 3)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                    
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_wo, labels), color="red", linestyle="--")  
                    plt.show()
                    
                    cluster_counts = np.bincount(labels)
                    total_count = len(labels)
                    percentages = (cluster_counts / total_count) * 100
                    
                    plt.figure(figsize=(8, 8))
                    plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(3)], colors=colors, autopct='%1.1f%%',
                            shadow=True, startangle=140)
                    plt.title("Percentage Distribution of Clusters")
                    plt.show()

Code

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    labels = kmeans.fit_predict(df_wo)  
                    
                    results["KMeans"]["outlier_free"]["Silhouette Coefficient"] = silhouette_score(df_wo, labels)
                    results["KMeans"]["outlier_free"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_wo, labels)
                    results["KMeans"]["outlier_free"]["Davies-Bouldin Index"] = davies_bouldin_score(df_wo, labels)
                    
                    for metric, value in results["KMeans"]["outlier_free"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.32
Calinski-Harabasz Index: 4345.55
Davies-Bouldin Index: 1.42

5.1.3 K-Means UMAP

UMAP (Uniform Manifold Approximation and Projection)

UMAP (Uniform Manifold Approximation and Projection) is a modern dimensionality reduction technique used for visualizing high-dimensional data sets in a lower-dimensional space. It is known for being fast and scalable, especially for large datasets.

Preservation of Local and Global Structure: UMAP aims to preserve both the local and global structures of the data. This is beneficial, especially for complex data structures.

General Applications: UMAP is not only used for visualization but also for general dimensionality reduction applications.

One of the biggest advantages of UMAP is that it operates faster and has more general-purpose features compared to other dimensionality reduction methods like t-SNE.

Code

reducer = umap.UMAP()
                    
                    embedding = reducer.fit_transform(df_scaled)
                    
                    df_umap = pd.DataFrame(embedding, columns=['UMAP 1', 'UMAP 2'])
                    
                    plt.figure(figsize=(12, 8))
                    plt.scatter(df_umap['UMAP 1'], df_umap['UMAP 2'], cmap='Spectral', s=5)
                    plt.gca().set_aspect('equal', 'datalim')
                    plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
                    plt.title('UMAP Projection', fontsize=24);
                    plt.show()

Code

k_values = range(1, 11) 
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_umap)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_umap, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code


                    kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_umap)
                    labels = kmeans.labels_
                    umap_array = df_umap.values
                    
                    
                    fig = plt.figure(figsize=(10, 8))
                    colors = ['#1f77b4', '#ff7f0e']
                    
                    for i, color in enumerate(colors):
                        plt.scatter(umap_array[labels == i, 0], umap_array[labels == i, 1], 
                                    c=color, label=f'Cluster {i+1}', s=50)
                    
                    
                    plt.title("2D Scatter Plot of Clusters")
                    plt.xlabel("Feature 1")
                    plt.ylabel("Feature 2")
                    plt.legend()
                    plt.show()
                    
                    
                    silhouette_vals = silhouette_samples(df_umap, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(2):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 2)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--")  
                    plt.show()

Code

kmeans_umap = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    labels_umap = kmeans_umap.fit_predict(df_umap)
                    
                    results["KMeans"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap)
                    results["KMeans"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["KMeans"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["KMeans"]["umap"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.42
Calinski-Harabasz Index: 8208.00
Davies-Bouldin Index: 0.95

5.1.4 K-Means PCA

PCA (Principal Component Analysis)

PCA (Principal Component Analysis) is a statistical method used for dimensionality reduction in multivariate data sets by maximizing the variance among variables. Principal components are selected orthogonally to capture the main structure of the data.

Variance Maximization: PCA selects the principal components to capture the maximum variance in the data set.

Low-Dimensional Representation: It represents the data set with fewer features, making it useful for visualization and modeling.

Explained Variance: It provides information on how much variance each component explains in the data set, thus making it easier to select the most important components.

PCA is a popular dimensionality reduction method frequently used in data science and machine learning applications.

Code

from sklearn.decomposition import PCA
                    
                    pca = PCA(n_components=2)
                    df_pca = pca.fit_transform(df_scaled)
                    
                    print(df_pca.shape)
                    
                    explained_variance = pca.explained_variance_ratio_
                    print(f"Explained Variance (first two components): {100 * explained_variance.sum():.2f}%")

(8950, 2)
Explained Variance (first two components): 47.59%

Code

k_values = range(1, 11)
                    wcss = [] 
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_pca)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_pca, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_pca)
                    kmeans_labels_pca = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
                    
                    for i, color in enumerate(colors):
                        plt.scatter(df_pca[kmeans_labels_pca == i, 0], df_pca[kmeans_labels_pca == i, 1], 
                                    c=color, label=f'Cluster {i+1}', s=50)
                    
                    plt.title("2D Scatter Plot of Clusters")
                    plt.xlabel("Feature 1")
                    plt.ylabel("Feature 2")
                    plt.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_pca, kmeans_labels_pca)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(3):
                        ith_cluster_silhouette_values = silhouette_vals[kmeans_labels_pca == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 3)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, kmeans_labels_pca), color="red", linestyle="--")  
                    plt.show()

Code

kmeans_pca = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans_labels_pca = kmeans_pca.fit_predict(df_pca)
                    
                    results["KMeans"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, kmeans_labels_pca)
                    results["KMeans"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, kmeans_labels_pca)
                    results["KMeans"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, kmeans_labels_pca)
                    
                    for metric, value in results["KMeans"]["pca"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.45
Calinski-Harabasz Index: 5337.49
Davies-Bouldin Index: 0.81

5.2 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It identifies dense regions of data points and assigns points in these regions to the same cluster. Points in non-dense regions are classified as noise.

Working Principle: - A neighborhood of a certain radius (eps) is defined around each data point. If there are more than min_samples data points within this neighborhood, the point is considered part of a dense region. - Dense regions are merged to form broad clusters. - Points outside dense regions and without enough neighbors in their vicinity are labeled as noise.

Advantages: - It makes no assumptions about the shapes of clusters. Therefore, clusters do not have to be circular. - It can automatically distinguish noise. - You do not have to specify the number of clusters beforehand.

Challenges: - It may struggle to accurately separate clusters with different densities. - The eps and min_samples parameters need to be properly tuned.

DBSCAN is a suitable option especially for noisy datasets and clusters with complex structures.

5.2.1 DBSCAN Normal

Code

X = df_scaled
                    
                    def epsilon(X):
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                        
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for Scaled Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code

dbscan = DBSCAN(eps=2.2, min_samples=5)
                    labels = dbscan.fit_predict(df_scaled)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_scaled[class_member_mask]
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('Feature 1')  
                    plt.ylabel('Feature 2')  
                    plt.grid(True)
                    plt.show()

Code

dbscan_normal = DBSCAN(eps=2.2, min_samples=5)
                    labels_normal = dbscan_normal.fit_predict(df_scaled)
                    
                    results["DBSCAN"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0
                    results["DBSCAN"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal)
                    results["DBSCAN"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal)
                    
                    for metric, value in results["DBSCAN"]["normal"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.52
Calinski-Harabasz Index: 939.76
Davies-Bouldin Index: 1.97

5.2.2 DBSCAN UMAP

Code

X = df_umap
                    
                    
                    def epsilon(X):
                        
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                        
                        
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for UMAP Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code

dbscan = DBSCAN(eps=0.15, min_samples=2)
                    labels = dbscan.fit_predict(df_umap.values)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1: 
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_umap.values[class_member_mask] 
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('UMAP 1')
                    plt.ylabel('UMAP 2')
                    plt.grid(True)
                    plt.show()

Code

dbscan_umap = DBSCAN(eps=2.23, min_samples=2)
                    labels_umap = dbscan_umap.fit_predict(df_umap)
                    
                    results["DBSCAN"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0
                    results["DBSCAN"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["DBSCAN"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["DBSCAN"]["umap"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.25
Calinski-Harabasz Index: 232.31
Davies-Bouldin Index: 0.63

5.2.3 DBSCAN PCA

Code

X = df_pca
                    
                    def epsilon(X):
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                    
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for PCA Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code

dbscan = DBSCAN(eps=1, min_samples=5)
                    labels = dbscan.fit_predict(df_pca)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_pca[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('PCA 1')
                    plt.ylabel('PCA 2')
                    plt.grid(True)
                    plt.show()

Code

dbscan_pca = DBSCAN(eps=1, min_samples=5)
                    labels_pca = dbscan_pca.fit_predict(df_pca)
                    
                    results["DBSCAN"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0
                    results["DBSCAN"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca)
                    results["DBSCAN"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca)
                    
                    for metric, value in results["DBSCAN"]["pca"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.80
Calinski-Harabasz Index: 1347.22
Davies-Bouldin Index: 0.78

5.3 OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure)

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm and can be considered as a generalization of DBSCAN. However, it is known for its ability to identify structures of different densities in the dataset, instead of using a single eps value.

Working Principle: - The reachability distance is calculated for each data point. This determines how close a point is to another point at a certain density. - Based on these distances, a reachability graph is created. - Valleys in the graph indicate density-based clusters.

Advantages: - It can detect clusters of varying densities. - You do not have to select a fixed value for the eps parameter. - It can automatically distinguish noise.

Challenges: - The computational cost can be high, especially for large datasets. - Interpretation of the results might be slightly more challenging compared to DBSCAN.

OPTICS is quite useful for datasets with regions of varying density. It is successful in detecting transitions between clusters of different densities.

5.3.1 OPTICS Normal

Code

optics = OPTICS(min_samples=5, max_eps=2.2)
                    labels = optics.fit_predict(df_scaled)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_scaled[class_member_mask]
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)  
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('Feature 1')  
                    plt.ylabel('Feature 2')  
                    plt.grid(True)
                    plt.show()

Code

optics_normal = OPTICS(min_samples=9)
                    labels_normal = optics_normal.fit_predict(df_scaled)
                    
                    results["OPTICS"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0
                    results["OPTICS"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal)
                    results["OPTICS"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal)
                    
                    for metric, value in results["OPTICS"]["normal"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: -0.45
Calinski-Harabasz Index: 11.93
Davies-Bouldin Index: 1.33

5.3.2 OPTICS UMAP

Code

optics = OPTICS(min_samples=2, max_eps=0.15)
                    labels = optics.fit_predict(df_umap.values)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_umap.values[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('UMAP 1')
                    plt.ylabel('UMAP 2')
                    plt.grid(True)
                    plt.show()

Code

optics_umap = OPTICS(min_samples=2)
                    labels_umap = optics_umap.fit_predict(df_umap)
                    
                    results["OPTICS"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0
                    results["OPTICS"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["OPTICS"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["OPTICS"]["umap"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: 0.24
Calinski-Harabasz Index: 10.24
Davies-Bouldin Index: 1.43

5.3.3 OPTICS PCA

Code

optics = OPTICS(min_samples=5, max_eps=1)
                    labels = optics.fit_predict(df_pca)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        
                        class_member_mask = (labels == k)
                        xy = df_pca[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('PCA 1')
                    plt.ylabel('PCA 2')
                    plt.grid(True)

Code

optics_pca = OPTICS(min_samples=5)
                    labels_pca = optics_pca.fit_predict(df_pca)
                    
                    results["OPTICS"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0
                    results["OPTICS"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca)
                    results["OPTICS"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca)
                    
                    for metric, value in results["OPTICS"]["pca"].items():
                        print(f"{metric}: {value:.2f}")

Silhouette Coefficient: -0.18
Calinski-Harabasz Index: 10.06
Davies-Bouldin Index: 1.68

6. Results

Code

from IPython.display import display, Markdown
                    
                    def display_results(results):
                        for method, data in results.items():
                            display(Markdown(f"### {method}"))
                            for dataset, metrics in data.items():
                                display(Markdown(f"#### {dataset}"))
                                for metric, value in metrics.items():
                                    display(Markdown(f"- **{metric}**: {value:.2f}"))
                    
                    display_results(results)

KMeans

normal

Silhouette Coefficient: 0.25

Calinski-Harabasz Index: 1604.40

Davies-Bouldin Index: 1.60

umap

Silhouette Coefficient: 0.42

Calinski-Harabasz Index: 8208.00

Davies-Bouldin Index: 0.95

pca

Silhouette Coefficient: 0.45

Calinski-Harabasz Index: 5337.49

Davies-Bouldin Index: 0.81

outlier_free

Silhouette Coefficient: 0.32

Calinski-Harabasz Index: 4345.55

Davies-Bouldin Index: 1.42

DBSCAN

normal

Silhouette Coefficient: 0.52

Calinski-Harabasz Index: 939.76

Davies-Bouldin Index: 1.97

umap

Silhouette Coefficient: 0.25

Calinski-Harabasz Index: 232.31

Davies-Bouldin Index: 0.63

pca

Silhouette Coefficient: 0.80

Calinski-Harabasz Index: 1347.22

Davies-Bouldin Index: 0.78

OPTICS

normal

Silhouette Coefficient: -0.45

Calinski-Harabasz Index: 11.93

Davies-Bouldin Index: 1.33

umap

Silhouette Coefficient: 0.24

Calinski-Harabasz Index: 10.24

Davies-Bouldin Index: 1.43

pca

Silhouette Coefficient: -0.18

Calinski-Harabasz Index: 10.06

Davies-Bouldin Index: 1.68

Analysis of Clustering Results

KMeans

Normal

Silhouette Coefficient (SC): 0.25
This metric ranges between -1 and 1. Higher values indicate that the object is well-fitted to its own cluster and has a weak match with neighboring clusters. The score of 0.25 suggests that the clusters overlap to some extent and there is room for improvement.
Calinski-Harabasz Index (CHI): 1604.40
It is the ratio of within-cluster variance to between-cluster variance. Higher values indicate that clusters are dense and well-separated.
Davies-Bouldin Index (DBI): 1.60
It is the average similarity ratio of each cluster with its most similar cluster. Values close to 0 are better. The score of 1.60 indicates a moderate clustering structure.

UMAP

SC: 0.41
An improvement showing better defined clusters compared to the normal data after UMAP is applied.
CHI: 7902.77
A significant increase indicating that UMAP helped in improving the clustering structure.
DBI: 0.97
A decrease indicating better separation of clusters after UMAP is applied.

PCA

SC: 0.45
Better cluster definition after PCA is applied, compared to UMAP and normal data.
CHI: 5337.49
Lower than UMAP but higher than normal; indicates that PCA also aided in achieving a good clustering structure.
DBI: 0.81
Indicates better separation of clusters compared to using normal data.

Outlier-free (Without Outliers)

Metrics in this category show that removing outliers made the clustering structure slightly better compared to normal data, but not as good as applying dimension reduction techniques like PCA or UMAP.

DBSCAN

DBSCAN generally produced better SC scores for normal and pca data clusters compared to KMeans. Especially with a 0.80 SC, the pca data set shows very well defined clusters. DBI scores also reflect better cluster separation, especially for the pca data set, compared to KMeans. However, the CHI score, while being better for the normal data set, drops significantly for the UMAP data set. This indicates less dense clusters or greater cluster overlap.

OPTICS

Normal:

SC: -0.45
A negative silhouette score indicates that the clusters are not well-defined and overlap significantly. Other metrics also reflect a weak clustering structure.

UMAP & PCA:

Metrics for these data sets are not promising. Negative or low silhouette scores and very low CHI scores indicate poor cluster definition and density.

Summary

KMeans with UMAP or PCA
KMeans combined with UMAP or PCA seems to yield the best results. Although PCA shows a slight advantage in silhouette coefficient, both methods indicate well-defined clusters. This could make it worthwhile to evaluate both approaches for a specific application.

DBSCAN
The combination of DBSCAN with PCA has a fairly high silhouette coefficient, indicating that this approach also defines clusters quite well. However, the results obtained with UMAP are not as successful for DBSCAN.

OPTICS
The performance of OPTICS is generally lower compared to the other two methods depending on the dataset used. Particularly, the negative silhouette coefficients indicate that the clusters are quite poorly defined.

In conclusion, if your primary goal is to obtain well-defined clusters, it is recommended to combine KMeans with PCA or UMAP. On the other hand, it might be worth trying DBSCAN to deal with more complex structures, especially when combined with PCA. Despite the lower overall performance of OPTICS for this particular dataset, it might produce different results on different datasets, therefore, it's important to evaluate before using it.

Code

# Orijinal veri setinize etiketleri ekleyin.
                    df['Cluster'] = kmeans_labels_pca
                    
                    # Her bir küme için sütun ortalamalarını alın.
                    cluster_means = df.groupby('Cluster').mean()
                    
                    # Ortalama değerleri tablo olarak göster.
                    cluster_means

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	...	PAYMENTS_RANGE	MINIMUM_PAYMENTS_RANGE	BALANCE_FREQUENCY_RANGE	PURCHASES_FREQUENCY_RANGE	ONEOFF_PURCHASES_FREQUENCY_RANGE	PURCHASES_INSTALLMENTS_FREQUENCY_RANGE	CASH_ADVANCE_FREQUENCY_RANGE	PRC_FULL_PAYMENT_RANGE	PURCHASES_TRX_RANGE	CASH_ADVANCE_TRX_RANGE
Cluster
0	789.381289	0.835555	516.904913	264.930837	252.295136	316.614221	0.474579	0.140900	0.348978	0.067312	...	1.869153	1.404809	8.548250	5.000981	1.566405	3.695944	0.802257	1.672718	2.052993	0.443409
1	3925.835034	0.956794	361.745825	241.267864	120.553232	3791.893111	0.214590	0.106232	0.128552	0.437004	...	3.077906	2.484480	9.663421	2.309191	1.191722	1.390140	4.780280	0.364577	1.125380	2.623859
2	2284.681931	0.981539	4378.858533	2754.504803	1624.856664	498.773452	0.950954	0.650455	0.768528	0.067055	...	3.704107	1.886840	9.885163	9.659681	6.824811	7.924560	0.768650	3.176865	6.145013	0.464376

3 rows × 33 columns

Predicting/interpreting clusters

Cluster 0

Balance (BALANCE): The customers of this group carry an average level balance. This means they have neither very high nor very low spending capacity.
Purchases (PURCHASES): Customers in this group make purchases at an average level, however, they make these purchases both at once and in installments.
Cash Advance (CASH_ADVANCE): There is an average level of cash advance usage.
Purchase Frequency (PURCHASES_FREQUENCY): These customers shop regularly, but not very frequently.
Credit Limit (CREDIT_LIMIT): The credit limit of this group may be lower compared to other groups.
They can be evaluated as average users. These users can use their credit cards for both daily shopping and larger purchases. The medium level of cash advance usage shows that they sometimes meet their cash needs from their cards.

Cluster 1

Balance (BALANCE): The customers of this group carry quite a high balance, which means they have a high spending capacity.
Purchases (PURCHASES): They shop less, which means this group is less active.
Cash Advance (CASH_ADVANCE): A high amount of cash advance usage is evident in this group, which shows these customers often have cash needs.
Purchase Frequency (PURCHASES_FREQUENCY): These customers shop less frequently.
Credit Limit (CREDIT_LIMIT): The credit limit of this group is higher than the other groups.
This group seems to include individuals who use the credit card for larger purchases or high-priced acquisitions. High balances and credit limits indicate that this group has a high financial capacity and spending potential.

Cluster 2

Balance (BALANCE): The balance of this group is high, but not as much as Cluster 1.
Purchases (PURCHASES): Customers in this group make very high amounts of purchases and these purchases are made both at once and in installments.
Cash Advance (CASH_ADVANCE): Cash advance usage is slightly above average.
Purchase Frequency (PURCHASES_FREQUENCY): This customer group shops frequently.
Credit Limit (CREDIT_LIMIT): The credit limit of this group is average compared to the other two groups.
There's a high probability that this group consists of young individuals or students, this group may represent individuals who use their cards for daily small expenses. Hence, this could be the reason for their frequent shopping and generally small amount expenditures.

However, to increase the accuracy of these interpretations, additional demographic or behavioral information (such as age, occupation, income level) may be needed. Such additional information could help us describe each cluster more detailed and accurately. Unfortunately, these are not available in the dataset I worked on. Since I want to use pydash in the continuation of the project, I will add age, gender, and city headings (I had added the city earlier). These additions are completely random and are necessary for me to proceed without affecting this analysis.

You can check the dash application I made to turn this analysis process into an app and dashboard here.