Mehmet Emre Toktay
Resume & CV Pardus Blog
TR | EN

04/08/2023 • Mehmet Emre Toktay

Bank Customer Segmentation: Analysis with K-Means, DBscan, and OPTICS
Table of Contents
  • 1. Introduction
  • 2. Initial Review of the Data Set
  • 3. Preprocessing
  • 4. Exploratory Data Analysis (EDA)
  • 5. Clustering Models
    • 5.1 K-Means
      • 5.1.1 K-Means Normal
      • 5.1.2 K-Means Outlier Adjusted K-Means
      • 5.1.3 K-Means UMAP
      • 5.1.4 K-Means PCA
    • 5.2 DBSCAN
      • 5.2.1 DBSCAN Normal
      • 5.2.2 DBSCAN UMAP
      • 5.2.3 DBSCAN PCA
    • 5.3 OPTICS
      • 5.3.1 OPTICS Normal
      • 5.3.2 OPTICS UMAP
      • 5.3.3 OPTICS PCA
  • 6. Results
    • KMeans
      • Normal
      • UMAP
      • PCA
      • Outlier-free
    • DBSCAN
    • OPTICS
      • Normal:
      • UMAP & PCA:
    • Summary
    • Predicting/interpreting clusters
      • Cluster 0
      • Cluster 1
      • Cluster 2

Bank Customer Segmentation: Analysis with K-Means, DBscan, and OPTICS

1. Introduction

In this study, a data set containing the usage behavior of approximately 9000 credit card users in the last six months will be used. Dividing customer behavior into several groups is necessary to obtain an effective and efficient credit card marketing strategy.

The objectives of this notebook are as follows:

- Explore the data set using data visualization techniques.

- Perform data preprocessing before using models.

- Group customers into clusters using different clustering models.

- Interpret and analyze the created groups (profiling).

- Provide marketing suggestions based on the profiling results and analyses.

Code
import numpy as np
                    import pandas as pd
                    import matplotlib.pyplot as plt
                    import seaborn as sns
                    from sklearn.cluster import KMeans, OPTICS
                    from sklearn.preprocessing import StandardScaler
                    from pyclustertend import hopkins
                    from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, silhouette_samples
                    from scipy.spatial.distance import pdist, squareform
                    from mpl_toolkits.mplot3d import Axes3D
                    from yellowbrick.cluster import SilhouetteVisualizer
                    import umap
                    from sklearn.neighbors import NearestNeighbors
                    import matplotlib.cm as cm
                    from sklearn.cluster import DBSCAN
Code
url = 'https://raw.githubusercontent.com/EmreToktay/pydash/main/Creditcard.csv'
                    df = pd.read_csv(url)

2. Initial Review of the Data Set

First, let's get acquainted with the data set we have at hand;

Code
print(df.shape)
                    df.head()
(8950, 19)
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE City
0 C10001 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12 Bursa
1 C10002 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12 Şanlıurfa
2 C10003 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12 Çorum
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12 Balıkesir
4 C10005 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12 Batman

Some detail regarding the headers;

  • BALANCE: The available usage of your credit card

  • BALANCE_FREQUENCY: How frequently the balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

  • PURCHASES: The amount of shopping made from the account

  • ONEOFF_PURCHASES: The highest shopping amount made at one time

  • INSTALLMENTS_PURCHASES: The amount of shopping made in installments

  • CASH_ADVANCE: Cash Advance

  • PURCHASES_FREQUENCY: How frequently shopping is done, score between 0 and 1 (1 = shopping is done frequently, 0 = shopping is not done frequently)

  • ONEOFF_PURCHASES_FREQUENCY: How frequently one-off purchases are made (1 = frequently done, 0 = not frequently done)

  • PURCHASES_INSTALLMENTS_FREQUENCY: How frequently installment purchases are made (1 = frequently done, 0 = not frequently done)

  • CASH_ADVANCE_FREQUENCY:How frequently the cash taken in advance is paid

  • CASH_ADVANCE_TRX: The number of transactions made with “Cash Advance”

  • PURCHASES_TRX: The number of shopping transactions made

  • CREDIT_LIMIT: The credit card limit of the user

  • PAYMENTS: The amount of payment made by the user

  • MINIMUM_PAYMENTS: The amount of minimum payment made by the user

  • PRC_FULL_PAYMENT: The percentage of full payment made by the user

  • TENURE: The tenure of the user's credit card service

Let's examine the data types;

Code
df.info()

                    <class 'pandas.core.frame.DataFrame'>
                    RangeIndex: 8950 entries, 0 to 8949
                    Data columns (total 19 columns):
                     #   Column                            Non-Null Count  Dtype
                    ---  ------                            --------------  -----
                     0   CUST_ID                           8950 non-null   object
                     1   BALANCE                           8950 non-null   float64
                     2   BALANCE_FREQUENCY                 8950 non-null   float64
                     3   PURCHASES                         8950 non-null   float64
                     4   ONEOFF_PURCHASES                  8950 non-null   float64
                     5   INSTALLMENTS_PURCHASES            8950 non-null   float64
                     6   CASH_ADVANCE                      8950 non-null   float64
                     7   PURCHASES_FREQUENCY               8950 non-null   float64
                     8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
                     9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
                     10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
                     11  CASH_ADVANCE_TRX                  8950 non-null   int64
                     12  PURCHASES_TRX                     8950 non-null   int64
                     13  CREDIT_LIMIT                      8949 non-null   float64
                     14  PAYMENTS                          8950 non-null   float64
                     15  MINIMUM_PAYMENTS                  8637 non-null   float64
                     16  PRC_FULL_PAYMENT                  8950 non-null   float64
                     17  TENURE                            8950 non-null   int64
                     18  City                              8950 non-null   object
                    dtypes: float64(14), int64(3), object(2)
                    memory usage: 1.3+ MB

Basic statistical values for each header:

Code
df.describe()
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8949.000000 8950.000000 8637.000000 8950.000000 8950.000000
mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458 0.364437 0.135144 3.248827 14.709832 4494.449450 1733.143852 864.206542 0.153715 11.517318
std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336 0.397448 0.200121 6.824647 24.857649 3638.815725 2895.063757 2372.446607 0.292499 1.338331
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.019163 0.000000 6.000000
25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000 0.000000 0.000000 0.000000 1.000000 1600.000000 383.276166 169.123707 0.000000 12.000000
50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333 0.166667 0.000000 0.000000 7.000000 3000.000000 856.901546 312.343947 0.000000 12.000000
75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000 0.750000 0.222222 4.000000 17.000000 6500.000000 1901.134317 825.485459 0.142857 12.000000
max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000 1.000000 1.500000 123.000000 358.000000 30000.000000 50721.483360 76406.207520 1.000000 12.000000

  • Based on the percentile values of each feature in the table above, it may be concluded that CASH_ADVANCE_TRX, PURCHASES_TRX and TENURE features may not be continuous. However, they should be checked to prove this hypothesis.

  • Looking at the count of each feature, there are some missing values in the CREDIT_LIMIT and MINIMUM_PAYMENTS columns.

  • Looking at the percentiles, the distributions of some features are heavily skewed and require more detailed analysis.

  • In conclusion, visualizing and analyzing the charts can help in understanding the data more in-depth.

3. Preprocessing

Analysis and proportion of null values in the dataset:

Code
df.isna().mean()*100

                    CUST_ID                             0.000000
                    BALANCE                             0.000000
                    BALANCE_FREQUENCY                   0.000000
                    PURCHASES                           0.000000
                    ONEOFF_PURCHASES                    0.000000
                    INSTALLMENTS_PURCHASES              0.000000
                    CASH_ADVANCE                        0.000000
                    PURCHASES_FREQUENCY                 0.000000
                    ONEOFF_PURCHASES_FREQUENCY          0.000000
                    PURCHASES_INSTALLMENTS_FREQUENCY    0.000000
                    CASH_ADVANCE_FREQUENCY              0.000000
                    CASH_ADVANCE_TRX                    0.000000
                    PURCHASES_TRX                       0.000000
                    CREDIT_LIMIT                        0.011173
                    PAYMENTS                            0.000000
                    MINIMUM_PAYMENTS                    3.497207
                    PRC_FULL_PAYMENT                    0.000000
                    TENURE                              0.000000
                    City                                0.000000
                    dtype: float64

There are empty values in the Minimum_payments and Credit_Limits, and we are addressing these for the continuation of the analysis;

Code
df.loc[(df['MINIMUM_PAYMENTS'].isnull()==True),'MINIMUM_PAYMENTS']=df['MINIMUM_PAYMENTS'].mean()
                    df.loc[(df['CREDIT_LIMIT'].isnull()==True),'CREDIT_LIMIT']=df['CREDIT_LIMIT'].mean()
Code
df.isna().mean()*100

                    CUST_ID                             0.0
                    BALANCE                             0.0
                    BALANCE_FREQUENCY                   0.0
                    PURCHASES                           0.0
                    ONEOFF_PURCHASES                    0.0
                    INSTALLMENTS_PURCHASES              0.0
                    CASH_ADVANCE                        0.0
                    PURCHASES_FREQUENCY                 0.0
                    ONEOFF_PURCHASES_FREQUENCY          0.0
                    PURCHASES_INSTALLMENTS_FREQUENCY    0.0
                    CASH_ADVANCE_FREQUENCY              0.0
                    CASH_ADVANCE_TRX                    0.0
                    PURCHASES_TRX                       0.0
                    CREDIT_LIMIT                        0.0
                    PAYMENTS                            0.0
                    MINIMUM_PAYMENTS                    0.0
                    PRC_FULL_PAYMENT                    0.0
                    TENURE                              0.0
                    City                                0.0
                    dtype: float64

We are dropping two headers that are not necessary for Customer Segmentation; these headers will be added back after the analysis.

Code
df = df.drop(columns=['City', 'CUST_ID'])

4. Exploratory Data Analysis (EDA) 🔍

To visualize the data, I used the Sweetviz library. Regarding Sweetviz; it is a data visualization library for Python. It is designed for the purpose of Exploratory Data Analysis (EDA). It quickly visualizes one or two data frames, producing detailed and visually appealing reports.

Code
import sweetviz as sv
                    
                    # Veriyi analiz edin
                    report = sv.analyze(df)
                    
                    # Raporu doğrudan Jupyter Notebook'ta göstermek için aşağıdaki kodu kullanabilirsiniz (isteğe bağlı)
                    report.show_notebook()

  • Except for PURCHASES_FREQUENCY, none of the headers have a balanced distribution. Moreover, even PURCHASES_FREQUENCY does not have a normal distribution!
  • As mentioned in the previous section, there is a significant skewness in some headers. As a result, a more detailed examination is needed to identify outliers. Also, conventional clustering algorithms may not be efficient enough.

Data scaling is often used to ensure machine learning algorithms operate more effectively. Specifically, many algorithms can produce misleading results due to different scales among features. Therefore, it's important to bring all features to a similar scale.

The scaling method we used is StandardScaler. StandardScaler scales each feature to have a mean of 0 and a standard deviation of 1. This is also known as z-score normalization.

Code
scaler = StandardScaler()
                    df_scaled = scaler.fit_transform(df)

With this code, our data has been successfully scaled and is ready for model training or other analyses.

Hopkins Statistic

Before starting the clustering analysis, it is important to check whether the dataset is suitable for clustering. For this purpose, the Hopkins Statistic is used to check whether the dataset is randomly distributed.

The Hopkins Statistic is a test that measures how suitable a dataset is for clustering. A value close to 0.5 indicates that the data has a random distribution, and therefore is not suitable for clustering. On the other hand, if the value is greater than 0.7, the dataset is likely suitable for clustering.

Code
hopkins_score = hopkins(df_scaled, sampling_size=len(df_scaled))
                    print(hopkins_score)
0.034992587280613066

According to this result, the value of 0.0351 is quite low for the Hopkins Statistic, indicating that the dataset does not have a clear structure or cluster, and thus may not be suitable for clustering.

Having previously looked at the overall structure of the dataset, we had seen excessive outliers and skewness, so using the Hopkins score over the entire data could be misleading. Therefore, I applied the Hopkins test again on a random 10% of the data;

Code
import numpy as np
                    from random import sample
                    from sklearn.neighbors import NearestNeighbors
                    from numpy.random import uniform
                    from math import isnan
                    
                    def hopkins(X):
                        d = X.shape[1]
                        n = len(X)
                        m = int(0.1 * n)
                        nbrs = NearestNeighbors(n_neighbors=1).fit(X)
                     
                        rand_X = sample(range(0, n, 1), m)
                     
                        ujd = []
                        wjd = []
                        for j in range(0, m):
                            u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
                            ujd.append(u_dist[0][1])
                            w_dist, _ = nbrs.kneighbors(X[rand_X[j]].reshape(1, -1), 2, return_distance=True)
                            wjd.append(w_dist[0][1])
                     
                        H = sum(ujd) / (sum(ujd) + sum(wjd))
                        if isnan(H):
                            print(ujd, wjd)
                            H = 0
                     
                        return H
                    
                    # Hopkins Testini Gerçekleştir
                    hopkins_value = hopkins(df_scaled)
                    hopkins_result = 'Sonuç: {:.4f}'.format(hopkins_value)
                    print('.: Hopkins Testi :.')
                    print(hopkins_result)
                    if 0.7 < hopkins_value < 0.99:
                        print('>> Yukarıdaki sonuca göre, yüksek bir Clusterleme eğilimi var (anlamlı kümeler içeriyor)')
                        print('.:. Sonuçlar: H0 Kabul .:.')
                    else:
                        print('>> Yukarıdaki sonuca göre, anlamlı kümeler yok')
                        print('.:. Sonuçlar: H0 Red .:.')
.: Hopkins Testi :.
                    Sonuç: 0.9672
                    >> Yukarıdaki sonuca göre, yüksek bir kümeleme eğilimi var (anlamlı kümeler içeriyor)
                    .:. Sonuçlar: H0 Kabul .:.

In the initial Hopkins statistic test conducted on the entire dataset, we obtained a low value regarding how clustering-prone the data is compared to a random distribution. This result might indicate that at first glance, the data does not contain distinct clusters or the separation of clusters is challenging.

The result of the Hopkins test conducted on a random 10% subset showed that the data has a high tendency for clustering.

Consequently, in the analysis of extensive data sets, there are advantages to conducting analysis both on the entire data set and on random sub-sets. The results of both approaches provide a more holistic understanding of the overall structure of the dataset and its clustering tendency.

It's time to examine the average of the outliers found in the dataset. To detect outliers, we used the IQR (Interquartile Range) method. ;

Code
outlier_percentage = {}
                    for feature in df:
                        tempData = df.sort_values(by=feature)[feature]
                        Q1, Q3 = tempData.quantile([0.25, 0.75])
                        IQR = Q3 - Q1
                        Lower_range = Q1 - (1.5 * IQR)
                        Upper_range = Q3 + (1.5 * IQR)
                        
                        outlier_count = ((tempData < Lower_range) | (tempData > Upper_range)).sum()
                        outlier_perc = round((outlier_count / tempData.shape[0]) * 100, 2)
                        outlier_percentage[feature] = outlier_perc
                    
                    outlier_percentage

                     {'BALANCE': 7.77,
                     'BALANCE_FREQUENCY': 16.68,
                     'PURCHASES': 9.03,
                     'ONEOFF_PURCHASES': 11.32,
                     'INSTALLMENTS_PURCHASES': 9.69,
                     'CASH_ADVANCE': 11.51,
                     'PURCHASES_FREQUENCY': 0.0,
                     'ONEOFF_PURCHASES_FREQUENCY': 8.74,
                     'PURCHASES_INSTALLMENTS_FREQUENCY': 0.0,
                     'CASH_ADVANCE_FREQUENCY': 5.87,
                     'CASH_ADVANCE_TRX': 8.98,
                     'PURCHASES_TRX': 8.56,
                     'CREDIT_LIMIT': 2.77,
                     'PAYMENTS': 9.03,
                     'MINIMUM_PAYMENTS': 8.65,
                     'PRC_FULL_PAYMENT': 16.47,
                     'TENURE': 15.26}

According to these results, in some features; for instance, BALANCE_FREQUENCY, PRC_FULL_PAYMENT, and TENURE the percentage of outliers is quite high. It's necessary to evaluate the effects of these features on the dataset and how to handle these outliers. On the other hand, in some features; there are no outliers in PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY.

Outliers could have adverse effects on our analyses and modelings, hence, it's important to develop strategies regarding these values. These strategies could be in the form of deletion, transformation, or handling them via another method.

It's not always obligatory to deal with outliers as in some cases it could even be misleading. However, in the subsequent analyses, I created a dataset where actions have been taken on outliers, thus had the opportunity to examine different results.

Code
def preprocess_data(data):
                        feature_boundaries = {
                            'BALANCE': [0, 500, 1000, 3000, 5000, 10000],
                            'PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'ONEOFF_PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'INSTALLMENTS_PURCHASES': [0, 500, 1000, 3000, 5000, 10000],
                            'CASH_ADVANCE': [0, 500, 1000, 3000, 5000, 10000],
                            'CREDIT_LIMIT': [0, 500, 1000, 3000, 5000, 10000],
                            'PAYMENTS': [0, 500, 1000, 3000, 5000, 10000],
                            'MINIMUM_PAYMENTS': [0, 500, 1000, 3000, 5000, 10000]
                        }
                    
                        frequency_boundaries = {
                            'BALANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'ONEOFF_PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PURCHASES_INSTALLMENTS_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'CASH_ADVANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                            'PRC_FULL_PAYMENT': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
                        }
                    
                        trx_boundaries = {
                            'PURCHASES_TRX': [0, 5, 10, 15, 20, 30, 50, 100],
                            'CASH_ADVANCE_TRX': [0, 5, 10, 15, 20, 30, 50, 100]
                        }
                    
                        def assign_range(column, boundaries):
                            new_column = column + '_RANGE'
                            data[new_column] = 0
                            for idx, boundary in enumerate(boundaries):
                                if idx == len(boundaries) - 1:
                                    data.loc[data[column] > boundary, new_column] = idx + 1
                                else:
                                    data.loc[(data[column] > boundary) & (data[column] <= boundaries[idx + 1]), new_column] = idx + 1
                    
                        for column, boundaries in feature_boundaries.items():
                            assign_range(column, boundaries)
                    
                        for column, boundaries in frequency_boundaries.items():
                            assign_range(column, boundaries)
                    
                        for column, boundaries in trx_boundaries.items():
                            assign_range(column, boundaries)
                    
                        columns_to_drop = [
                            'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES',
                            'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
                            'PURCHASES_FREQUENCY',  'ONEOFF_PURCHASES_FREQUENCY',
                            'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
                            'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
                            'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT'
                        ]
                        df_wo = df.drop(columns=columns_to_drop)
                    
                        return df_wo
                    
Code
df_wo = preprocess_data(df)

This data preprocessing function has been created to categorize certain columns into categorical ranges. Using predefined boundaries for these columns, new features have been derived based on these boundaries. While retaining the numerical values of the relevant features, new columns have been created to determine the categorical ranges corresponding to these values.

Feature Boundaries:

Boundaries representing value ranges have been determined for certain columns. For example, for the ‘BALANCE’ column, the boundaries have been defined as: 0, 500, 1000, 3000, 5000, 10000. These boundaries are used to determine in which range the values of a particular feature fall. Implementation of Boundaries:

A helper function named assign_range has been used for each column. This function takes the values of a particular column and converts these values into a categorical column based on the defined boundaries. For example, an observation with a value of 750 in the ‘BALANCE’ column will have a value of 2 in the ‘BALANCE_RANGE’ column, as this value falls between 500 and 1000. Removal of Unnecessary Columns:

The original numerical columns have become redundant once the columns are converted into categorical ranges. Therefore, these columns have been removed from the dataset. In conclusion, this function is used to convert certain columns into categorical ranges, and it removes the original numerical columns from the dataset following this conversion.

5. Clustering Models

In the review, we will try different clustering methods with different methodologies to find the most ideal one. For clustering models, I will use the K-Means, DBSCAN, and OPTICS methods. For K-means, I will use the unprocessed dataset, the dataset processed for outliers, and the datasets applied with PCA and UMAP. As DBSCAN and OPTICS are not significantly affected by outliers by nature, results will be obtained using the normal, PCA, and UMAP datasets.

Code
results = {
                        "KMeans": {
                            "normal": {},
                            "umap": {},
                            "pca": {},
                            "outlier_free": {}
                        },
                        "DBSCAN": {
                            "normal": {},
                            "umap": {},
                            "pca": {}
                        },
                        "OPTICS": {
                            "normal": {},
                            "umap": {},
                            "pca": {}
                        }
                    }

This code constructs a structure to store the results and performance metrics of different clustering algorithms. Specifically, we are evaluating the results of KMeans, DBSCAN and OPTICS clustering algorithms. Additionally, we will also test these algorithms on normal, UMAP reduced, PCA reduced, and outlier-free datasets. The schema of the roadmap I made for the analysis can be seen in this code. To briefly mention.

KMeans: It is an algorithm to cluster the data into a certain "k" number of clusters.

DBSCAN: It is a density-based clustering algorithm and does not require specifying the number of clusters in advance.

OPTICS: It is a generalized version of DBSCAN, a density-based clustering algorithm.

UMAP: It is a dimension reduction method used to reduce the data to a lower-dimensional space.

PCA: It is the abbreviation for Principal Component Analysis and is used to represent the data with fewer features.

Silhouette Coefficient: A metric measuring the internal cohesion of a cluster and the separation between clusters.

Calinski-Harabasz Index: A score measuring the quality of cluster dispersion.

Davies-Bouldin Index: It measures the ratio of intra-cluster similarities to inter-cluster similarities. Lower values indicate better clustering.

5.1 K-Means

K-Means is a clustering algorithm used to divide the data into ‘k’ number of clusters. The algorithm operates by randomly initializing each cluster center, and then iteratively updating these centers to be the average of the data points in the clusters. This process continues until a defined criterion (e.g., a situation where the centers do not move) is met. The elbow method and Silhouette score were used as supportive to the K-Means process;

Elbow Method One of the toughest decisions for the K-means algorithm is determining how many clusters the data should be divided into. The elbow method is a common technique used for this decision. This method involves calculating the total within-cluster sum of squares (WCSS) for different k values. As the value of k increases, WCSS decreases. However, after a certain k value, this reduction amount becomes insignificant. This point called the "elbow" helps us to determine the optimal k value.

Silhouette Score The Silhouette score measures how well the clusters are defined. This score takes values between -1 and 1. A value close to 1 indicates that the points are similar within their clusters and different from other clusters, while a value close to -1 indicates that these points are clustered incorrectly. A value of 0 indicates uncertainty about how far apart the clusters are from each other.

5.1.1 K-Means Normal

Code
k_values = range(1, 11)  
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_scaled)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx() 
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Evaluation of the Graph

Consider the following when evaluating the graph:

Elbow Point: When examining the WCSS graph (blue line), look for the k value where WCSS does not decrease significantly. This point can serve as a clue to determine the optimal k value.

Silhouette Score: When examining the silhouette score graph (orange line), look for the k value with the highest score. A high silhouette score indicates well-defined clusters.

By using these two measurements together, you can determine the most suitable k value for the dataset.

Code
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_scaled)
                    labels = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    ax = fig.add_subplot(111, projection='3d')
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
                    
                    for i, color in enumerate(colors):
                        ax.scatter(df_scaled[labels == i, 0], df_scaled[labels == i, 1], df_scaled[labels == i, 2], 
                                   c=color, label=f'Cluster {i+1}', s=50)
                    
                    ax.set_title("3D Scatter Plot of Clusters")
                    ax.set_xlabel("Feature 1")
                    ax.set_ylabel("Feature 2")
                    ax.set_zlabel("Feature 3")
                    ax.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_scaled, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(4):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 4)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--")  
                    plt.show()
                    
                    cluster_counts = np.bincount(labels)
                    total_count = len(labels)
                    percentages = (cluster_counts / total_count) * 100
                    
                    plt.figure(figsize=(8, 8))
                    plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(4)], colors=colors, autopct='%1.1f%%',
                            shadow=True, startangle=140)
                    plt.title("Percentage Distribution of Clusters")
                    plt.show()

1. 3D Cluster Distribution Graph This graph shows how clusters are distributed in a 3-dimensional space in the scaled dataset. Each color represents a different cluster.

Observation: If the clusters are distinctly separated from each other, it indicates that the K-Means algorithm has clustered well.

2. Silhouette Graph The silhouette graph evaluates how close each data point is to other data points in its own cluster, and how far it is from the nearest other cluster.

Bands: Each band represents a specific cluster. The width of the band represents the number of data points in that cluster, while its height represents the silhouette score.

Red Line: Average silhouette score. We can say that clusters with values close to or above this line are well-defined.

Observation: If the bands are near or above the red line and the band widths are similar, it indicates that the clusters are balanced and well-defined.

3. Cluster Distribution Percentages Graph This pie chart shows the percentage distribution of data points falling into each cluster.

Observation: If the slices are of similar size, it indicates a balanced distribution of clusters. However, if you see one or more slices significantly larger or smaller than the others, it indicates that certain clusters have more or fewer data points compared to others.

Code
kmeans = KMeans(n_clusters=3)  
                    labels = kmeans.fit_predict(df_scaled)  
                    
                    
                    results["KMeans"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels)
                    results["KMeans"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels)
                    results["KMeans"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels)
                    
                    for metric, value in results["KMeans"]["normal"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.25,
Calinski-Harabasz Index: 1604.40,
Davies-Bouldin Index: 1.60

Evaluation of Results You can find the meaning and how to evaluate the results obtained when the code is run below:

Silhouette Coefficient: This value ranges between -1 and 1. A value closer to 1 indicates well-defined clusters, while a value closer to -1 indicates wrongly defined clusters. 0 indicates that clusters are close to each other or overlapping.

Calinski-Harabasz Index: A higher value indicates better defined clusters. This criterion measures the ratio between the density and distribution of clusters.

Davies-Bouldin Index: Lower values indicate better defined clusters. This index measures the similarity ratio for each cluster and takes the average of this ratio.

By using these metrics, you can compare the quality of clustering results obtained with different algorithms or with different parameters of the same algorithm, and decide which approach is most suitable for your data.

I will share the general evaluation of these results at the end of the analysis.

5.1.2 K-Means Outlier Adjusted K-Means

Code
df_wo = preprocess_data(df)
Code
k_values = range(1, 11)  
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_wo)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_wo, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code

                    kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_wo)
                    labels = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    ax = fig.add_subplot(111, projection='3d')
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
                    
                    for i, color in enumerate(colors):
                        ax.scatter(df_wo[labels == i].iloc[:, 0], df_wo[labels == i].iloc[:, 1], df_wo[labels == i].iloc[:, 2], 
                                   c=color, label=f'Cluster {i+1}', s=50)
                    
                    ax.set_title("3D Scatter Plot of Clusters")
                    ax.set_xlabel("Feature 1")
                    ax.set_ylabel("Feature 2")
                    ax.set_zlabel("Feature 3")
                    ax.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_wo, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(3):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 3)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                    
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_wo, labels), color="red", linestyle="--")  
                    plt.show()
                    
                    cluster_counts = np.bincount(labels)
                    total_count = len(labels)
                    percentages = (cluster_counts / total_count) * 100
                    
                    plt.figure(figsize=(8, 8))
                    plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(3)], colors=colors, autopct='%1.1f%%',
                            shadow=True, startangle=140)
                    plt.title("Percentage Distribution of Clusters")
                    plt.show()

Code
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    labels = kmeans.fit_predict(df_wo)  
                    
                    results["KMeans"]["outlier_free"]["Silhouette Coefficient"] = silhouette_score(df_wo, labels)
                    results["KMeans"]["outlier_free"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_wo, labels)
                    results["KMeans"]["outlier_free"]["Davies-Bouldin Index"] = davies_bouldin_score(df_wo, labels)
                    
                    for metric, value in results["KMeans"]["outlier_free"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.32
Calinski-Harabasz Index: 4345.55
Davies-Bouldin Index: 1.42

5.1.3 K-Means UMAP

UMAP (Uniform Manifold Approximation and Projection)

UMAP (Uniform Manifold Approximation and Projection) is a modern dimensionality reduction technique used for visualizing high-dimensional data sets in a lower-dimensional space. It is known for being fast and scalable, especially for large datasets.

Preservation of Local and Global Structure: UMAP aims to preserve both the local and global structures of the data. This is beneficial, especially for complex data structures.

General Applications: UMAP is not only used for visualization but also for general dimensionality reduction applications.

One of the biggest advantages of UMAP is that it operates faster and has more general-purpose features compared to other dimensionality reduction methods like t-SNE.

Code
reducer = umap.UMAP()
                    
                    embedding = reducer.fit_transform(df_scaled)
                    
                    df_umap = pd.DataFrame(embedding, columns=['UMAP 1', 'UMAP 2'])
                    
                    plt.figure(figsize=(12, 8))
                    plt.scatter(df_umap['UMAP 1'], df_umap['UMAP 2'], cmap='Spectral', s=5)
                    plt.gca().set_aspect('equal', 'datalim')
                    plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
                    plt.title('UMAP Projection', fontsize=24);
                    plt.show()

Code
k_values = range(1, 11) 
                    wcss = []
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_umap)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_umap, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code

                    kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_umap)
                    labels = kmeans.labels_
                    umap_array = df_umap.values
                    
                    
                    fig = plt.figure(figsize=(10, 8))
                    colors = ['#1f77b4', '#ff7f0e']
                    
                    for i, color in enumerate(colors):
                        plt.scatter(umap_array[labels == i, 0], umap_array[labels == i, 1], 
                                    c=color, label=f'Cluster {i+1}', s=50)
                    
                    
                    plt.title("2D Scatter Plot of Clusters")
                    plt.xlabel("Feature 1")
                    plt.ylabel("Feature 2")
                    plt.legend()
                    plt.show()
                    
                    
                    silhouette_vals = silhouette_samples(df_umap, labels)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(2):
                        ith_cluster_silhouette_values = silhouette_vals[labels == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 2)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--")  
                    plt.show()
                    
                    

Code
kmeans_umap = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    labels_umap = kmeans_umap.fit_predict(df_umap)
                    
                    results["KMeans"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap)
                    results["KMeans"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["KMeans"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["KMeans"]["umap"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.42
Calinski-Harabasz Index: 8208.00
Davies-Bouldin Index: 0.95

5.1.4 K-Means PCA

PCA (Principal Component Analysis)

PCA (Principal Component Analysis) is a statistical method used for dimensionality reduction in multivariate data sets by maximizing the variance among variables. Principal components are selected orthogonally to capture the main structure of the data.

Variance Maximization: PCA selects the principal components to capture the maximum variance in the data set.

Low-Dimensional Representation: It represents the data set with fewer features, making it useful for visualization and modeling.

Explained Variance: It provides information on how much variance each component explains in the data set, thus making it easier to select the most important components.

PCA is a popular dimensionality reduction method frequently used in data science and machine learning applications.

Code
from sklearn.decomposition import PCA
                    
                    pca = PCA(n_components=2)
                    df_pca = pca.fit_transform(df_scaled)
                    
                    print(df_pca.shape)
                    
                    explained_variance = pca.explained_variance_ratio_
                    print(f"Explained Variance (first two components): {100 * explained_variance.sum():.2f}%")
(8950, 2)
Explained Variance (first two components): 47.59%
Code
k_values = range(1, 11)
                    wcss = [] 
                    silhouette_scores = []
                    
                    for k in k_values:
                        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
                        kmeans.fit(df_pca)
                        wcss.append(kmeans.inertia_)
                        if k > 1:  
                            silhouette_scores.append(silhouette_score(df_pca, kmeans.labels_))
                        else:
                            silhouette_scores.append(0)  
                    
                    fig, ax1 = plt.subplots(figsize=(12, 7))
                    
                    
                    ax1.set_xlabel('Küme Sayısı (k)')
                    ax1.set_ylabel('WCSS', color='tab:blue')
                    ax1.plot(k_values, wcss, 'o-', color='tab:blue')
                    ax1.tick_params(axis='y', labelcolor='tab:blue')
                    
                    ax2 = ax1.twinx()  
                    ax2.set_ylabel('Silhouette Skoru', color='tab:orange')
                    ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange')
                    ax2.tick_params(axis='y', labelcolor='tab:orange')
                    
                    fig.tight_layout()
                    plt.title('Dirsek Yöntemi ve Silhouette Skoru')
                    plt.show()

Code
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans.fit(df_pca)
                    kmeans_labels_pca = kmeans.labels_
                    
                    fig = plt.figure(figsize=(10, 8))
                    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
                    
                    for i, color in enumerate(colors):
                        plt.scatter(df_pca[kmeans_labels_pca == i, 0], df_pca[kmeans_labels_pca == i, 1], 
                                    c=color, label=f'Cluster {i+1}', s=50)
                    
                    plt.title("2D Scatter Plot of Clusters")
                    plt.xlabel("Feature 1")
                    plt.ylabel("Feature 2")
                    plt.legend()
                    plt.show()
                    
                    silhouette_vals = silhouette_samples(df_pca, kmeans_labels_pca)
                    
                    fig, ax = plt.subplots(1, 1, figsize=(8, 6))
                    y_lower = 10
                    
                    for i in range(3):
                        ith_cluster_silhouette_values = silhouette_vals[kmeans_labels_pca == i]
                        ith_cluster_silhouette_values.sort()
                        
                        size_cluster_i = ith_cluster_silhouette_values.shape[0]
                        y_upper = y_lower + size_cluster_i
                        
                        color = cm.nipy_spectral(float(i) / 3)
                        ax.fill_betweenx(np.arange(y_lower, y_upper),
                                         0, ith_cluster_silhouette_values,
                                         facecolor=color, edgecolor=color, alpha=0.7)
                        
                        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1))
                        
                        y_lower = y_upper + 10  
                    
                    ax.set_title("Silhouette Plot for the Clusters")
                    ax.set_xlabel("Silhouette Coefficient Values")
                    ax.set_ylabel("Cluster Label")
                    ax.set_yticks([])  
                    ax.axvline(x=silhouette_score(df_scaled, kmeans_labels_pca), color="red", linestyle="--")  
                    plt.show()

Code
kmeans_pca = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
                    kmeans_labels_pca = kmeans_pca.fit_predict(df_pca)
                    
                    results["KMeans"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, kmeans_labels_pca)
                    results["KMeans"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, kmeans_labels_pca)
                    results["KMeans"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, kmeans_labels_pca)
                    
                    for metric, value in results["KMeans"]["pca"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.45
Calinski-Harabasz Index: 5337.49
Davies-Bouldin Index: 0.81

5.2 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It identifies dense regions of data points and assigns points in these regions to the same cluster. Points in non-dense regions are classified as noise.

Working Principle: - A neighborhood of a certain radius (eps) is defined around each data point. If there are more than min_samples data points within this neighborhood, the point is considered part of a dense region. - Dense regions are merged to form broad clusters. - Points outside dense regions and without enough neighbors in their vicinity are labeled as noise.

Advantages: - It makes no assumptions about the shapes of clusters. Therefore, clusters do not have to be circular. - It can automatically distinguish noise. - You do not have to specify the number of clusters beforehand.

Challenges: - It may struggle to accurately separate clusters with different densities. - The eps and min_samples parameters need to be properly tuned.

DBSCAN is a suitable option especially for noisy datasets and clusters with complex structures.

5.2.1 DBSCAN Normal

Code
X = df_scaled
                    
                    def epsilon(X):
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                        
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for Scaled Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code
dbscan = DBSCAN(eps=2.2, min_samples=5)
                    labels = dbscan.fit_predict(df_scaled)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_scaled[class_member_mask]
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('Feature 1')  
                    plt.ylabel('Feature 2')  
                    plt.grid(True)
                    plt.show()

Code
dbscan_normal = DBSCAN(eps=2.2, min_samples=5)
                    labels_normal = dbscan_normal.fit_predict(df_scaled)
                    
                    results["DBSCAN"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0
                    results["DBSCAN"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal)
                    results["DBSCAN"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal)
                    
                    for metric, value in results["DBSCAN"]["normal"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.52
Calinski-Harabasz Index: 939.76
Davies-Bouldin Index: 1.97

5.2.2 DBSCAN UMAP

Code
X = df_umap
                    
                    
                    def epsilon(X):
                        
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                        
                        
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for UMAP Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code
dbscan = DBSCAN(eps=0.15, min_samples=2)
                    labels = dbscan.fit_predict(df_umap.values)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1: 
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_umap.values[class_member_mask] 
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('UMAP 1')
                    plt.ylabel('UMAP 2')
                    plt.grid(True)
                    plt.show()

Code
dbscan_umap = DBSCAN(eps=2.23, min_samples=2)
                    labels_umap = dbscan_umap.fit_predict(df_umap)
                    
                    results["DBSCAN"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0
                    results["DBSCAN"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["DBSCAN"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["DBSCAN"]["umap"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.25
Calinski-Harabasz Index: 232.31
Davies-Bouldin Index: 0.63

5.2.3 DBSCAN PCA

Code
X = df_pca
                    
                    def epsilon(X):
                        
                        neighbors = NearestNeighbors(n_neighbors=2)
                        nbrs = neighbors.fit(X)
                        distances, indices = nbrs.kneighbors(X)
                        distances = np.sort(distances, axis=0)
                    
                        distances_1 = distances[:, 1]
                        plt.plot(distances_1, color='#5829A7')
                        plt.xlabel('Total')
                        plt.ylabel('Distance')
                            
                        for spine in plt.gca().spines.values():
                            spine.set_color('None')
                            
                        plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted')
                        plt.grid(axis='x', alpha=0)
                        
                        plt.title('DBSCAN Epsilon Value for PCA Data')
                        plt.tight_layout()
                        plt.show();
                    
                    epsilon(X);

Code
dbscan = DBSCAN(eps=1, min_samples=5)
                    labels = dbscan.fit_predict(df_pca)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_pca[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('DBSCAN Clustering')
                    plt.xlabel('PCA 1')
                    plt.ylabel('PCA 2')
                    plt.grid(True)
                    plt.show()

Code
dbscan_pca = DBSCAN(eps=1, min_samples=5)
                    labels_pca = dbscan_pca.fit_predict(df_pca)
                    
                    results["DBSCAN"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0
                    results["DBSCAN"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca)
                    results["DBSCAN"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca)
                    
                    for metric, value in results["DBSCAN"]["pca"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.80
Calinski-Harabasz Index: 1347.22
Davies-Bouldin Index: 0.78

5.3 OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure)

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm and can be considered as a generalization of DBSCAN. However, it is known for its ability to identify structures of different densities in the dataset, instead of using a single eps value.

Working Principle: - The reachability distance is calculated for each data point. This determines how close a point is to another point at a certain density. - Based on these distances, a reachability graph is created. - Valleys in the graph indicate density-based clusters.

Advantages: - It can detect clusters of varying densities. - You do not have to select a fixed value for the eps parameter. - It can automatically distinguish noise.

Challenges: - The computational cost can be high, especially for large datasets. - Interpretation of the results might be slightly more challenging compared to DBSCAN.

OPTICS is quite useful for datasets with regions of varying density. It is successful in detecting transitions between clusters of different densities.

5.3.1 OPTICS Normal

Code
optics = OPTICS(min_samples=5, max_eps=2.2)
                    labels = optics.fit_predict(df_scaled)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_scaled[class_member_mask]
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)  
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('Feature 1')  
                    plt.ylabel('Feature 2')  
                    plt.grid(True)
                    plt.show()

Code
optics_normal = OPTICS(min_samples=9)
                    labels_normal = optics_normal.fit_predict(df_scaled)
                    
                    results["OPTICS"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0
                    results["OPTICS"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal)
                    results["OPTICS"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal)
                    
                    for metric, value in results["OPTICS"]["normal"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: -0.45
Calinski-Harabasz Index: 11.93
Davies-Bouldin Index: 1.33

5.3.2 OPTICS UMAP

Code
optics = OPTICS(min_samples=2, max_eps=0.15)
                    labels = optics.fit_predict(df_umap.values)  
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        class_member_mask = (labels == k)
                        xy = df_umap.values[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('UMAP 1')
                    plt.ylabel('UMAP 2')
                    plt.grid(True)
                    plt.show()

Code
optics_umap = OPTICS(min_samples=2)
                    labels_umap = optics_umap.fit_predict(df_umap)
                    
                    results["OPTICS"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0
                    results["OPTICS"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap)
                    results["OPTICS"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap)
                    
                    for metric, value in results["OPTICS"]["umap"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: 0.24
Calinski-Harabasz Index: 10.24
Davies-Bouldin Index: 1.43

5.3.3 OPTICS PCA

Code
optics = OPTICS(min_samples=5, max_eps=1)
                    labels = optics.fit_predict(df_pca)
                    
                    plt.figure(figsize=(10, 8))
                    unique_labels = np.unique(labels)
                    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
                    
                    for k, col in zip(unique_labels, colors):
                        if k == -1:  
                            col = [0.6, 0.6, 0.6, 1]
                        
                        class_member_mask = (labels == k)
                        xy = df_pca[class_member_mask]  
                        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)
                    
                    plt.title('OPTICS Clustering')
                    plt.xlabel('PCA 1')
                    plt.ylabel('PCA 2')
                    plt.grid(True)

Code
optics_pca = OPTICS(min_samples=5)
                    labels_pca = optics_pca.fit_predict(df_pca)
                    
                    results["OPTICS"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0
                    results["OPTICS"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca)
                    results["OPTICS"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca)
                    
                    for metric, value in results["OPTICS"]["pca"].items():
                        print(f"{metric}: {value:.2f}")
Silhouette Coefficient: -0.18
Calinski-Harabasz Index: 10.06
Davies-Bouldin Index: 1.68

6. Results

Code
from IPython.display import display, Markdown
                    
                    def display_results(results):
                        for method, data in results.items():
                            display(Markdown(f"### {method}"))
                            for dataset, metrics in data.items():
                                display(Markdown(f"#### {dataset}"))
                                for metric, value in metrics.items():
                                    display(Markdown(f"- **{metric}**: {value:.2f}"))
                    
                    display_results(results)

KMeans

normal

  • Silhouette Coefficient: 0.25
  • Calinski-Harabasz Index: 1604.40
  • Davies-Bouldin Index: 1.60

umap

  • Silhouette Coefficient: 0.42
  • Calinski-Harabasz Index: 8208.00
  • Davies-Bouldin Index: 0.95

pca

  • Silhouette Coefficient: 0.45
  • Calinski-Harabasz Index: 5337.49
  • Davies-Bouldin Index: 0.81

outlier_free

  • Silhouette Coefficient: 0.32
  • Calinski-Harabasz Index: 4345.55
  • Davies-Bouldin Index: 1.42

DBSCAN

normal

  • Silhouette Coefficient: 0.52
  • Calinski-Harabasz Index: 939.76
  • Davies-Bouldin Index: 1.97

umap

  • Silhouette Coefficient: 0.25
  • Calinski-Harabasz Index: 232.31
  • Davies-Bouldin Index: 0.63

pca

  • Silhouette Coefficient: 0.80
  • Calinski-Harabasz Index: 1347.22
  • Davies-Bouldin Index: 0.78

OPTICS

normal

  • Silhouette Coefficient: -0.45
  • Calinski-Harabasz Index: 11.93
  • Davies-Bouldin Index: 1.33

umap

  • Silhouette Coefficient: 0.24
  • Calinski-Harabasz Index: 10.24
  • Davies-Bouldin Index: 1.43

pca

  • Silhouette Coefficient: -0.18
  • Calinski-Harabasz Index: 10.06
  • Davies-Bouldin Index: 1.68

Analysis of Clustering Results

KMeans

Normal

  • Silhouette Coefficient (SC): 0.25
    This metric ranges between -1 and 1. Higher values indicate that the object is well-fitted to its own cluster and has a weak match with neighboring clusters. The score of 0.25 suggests that the clusters overlap to some extent and there is room for improvement.

  • Calinski-Harabasz Index (CHI): 1604.40
    It is the ratio of within-cluster variance to between-cluster variance. Higher values indicate that clusters are dense and well-separated.

  • Davies-Bouldin Index (DBI): 1.60
    It is the average similarity ratio of each cluster with its most similar cluster. Values close to 0 are better. The score of 1.60 indicates a moderate clustering structure.

UMAP

  • SC: 0.41
    An improvement showing better defined clusters compared to the normal data after UMAP is applied.

  • CHI: 7902.77
    A significant increase indicating that UMAP helped in improving the clustering structure.

  • DBI: 0.97
    A decrease indicating better separation of clusters after UMAP is applied.

PCA

  • SC: 0.45
    Better cluster definition after PCA is applied, compared to UMAP and normal data.

  • CHI: 5337.49
    Lower than UMAP but higher than normal; indicates that PCA also aided in achieving a good clustering structure.

  • DBI: 0.81
    Indicates better separation of clusters compared to using normal data.

Outlier-free (Without Outliers)

Metrics in this category show that removing outliers made the clustering structure slightly better compared to normal data, but not as good as applying dimension reduction techniques like PCA or UMAP.

DBSCAN

DBSCAN generally produced better SC scores for normal and pca data clusters compared to KMeans. Especially with a 0.80 SC, the pca data set shows very well defined clusters. DBI scores also reflect better cluster separation, especially for the pca data set, compared to KMeans. However, the CHI score, while being better for the normal data set, drops significantly for the UMAP data set. This indicates less dense clusters or greater cluster overlap.

OPTICS

Normal:

  • SC: -0.45
    A negative silhouette score indicates that the clusters are not well-defined and overlap significantly. Other metrics also reflect a weak clustering structure.

UMAP & PCA:

Metrics for these data sets are not promising. Negative or low silhouette scores and very low CHI scores indicate poor cluster definition and density.

Summary

KMeans with UMAP or PCA
KMeans combined with UMAP or PCA seems to yield the best results. Although PCA shows a slight advantage in silhouette coefficient, both methods indicate well-defined clusters. This could make it worthwhile to evaluate both approaches for a specific application.

DBSCAN
The combination of DBSCAN with PCA has a fairly high silhouette coefficient, indicating that this approach also defines clusters quite well. However, the results obtained with UMAP are not as successful for DBSCAN.

OPTICS
The performance of OPTICS is generally lower compared to the other two methods depending on the dataset used. Particularly, the negative silhouette coefficients indicate that the clusters are quite poorly defined.


In conclusion, if your primary goal is to obtain well-defined clusters, it is recommended to combine KMeans with PCA or UMAP. On the other hand, it might be worth trying DBSCAN to deal with more complex structures, especially when combined with PCA. Despite the lower overall performance of OPTICS for this particular dataset, it might produce different results on different datasets, therefore, it's important to evaluate before using it.

Code
# Orijinal veri setinize etiketleri ekleyin.
                    df['Cluster'] = kmeans_labels_pca
                    
                    # Her bir küme için sütun ortalamalarını alın.
                    cluster_means = df.groupby('Cluster').mean()
                    
                    # Ortalama değerleri tablo olarak göster.
                    cluster_means
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY ... PAYMENTS_RANGE MINIMUM_PAYMENTS_RANGE BALANCE_FREQUENCY_RANGE PURCHASES_FREQUENCY_RANGE ONEOFF_PURCHASES_FREQUENCY_RANGE PURCHASES_INSTALLMENTS_FREQUENCY_RANGE CASH_ADVANCE_FREQUENCY_RANGE PRC_FULL_PAYMENT_RANGE PURCHASES_TRX_RANGE CASH_ADVANCE_TRX_RANGE
Cluster
0 789.381289 0.835555 516.904913 264.930837 252.295136 316.614221 0.474579 0.140900 0.348978 0.067312 ... 1.869153 1.404809 8.548250 5.000981 1.566405 3.695944 0.802257 1.672718 2.052993 0.443409
1 3925.835034 0.956794 361.745825 241.267864 120.553232 3791.893111 0.214590 0.106232 0.128552 0.437004 ... 3.077906 2.484480 9.663421 2.309191 1.191722 1.390140 4.780280 0.364577 1.125380 2.623859
2 2284.681931 0.981539 4378.858533 2754.504803 1624.856664 498.773452 0.950954 0.650455 0.768528 0.067055 ... 3.704107 1.886840 9.885163 9.659681 6.824811 7.924560 0.768650 3.176865 6.145013 0.464376

3 rows × 33 columns

Predicting/interpreting clusters

Cluster 0

  • Balance (BALANCE): The customers of this group carry an average level balance. This means they have neither very high nor very low spending capacity.

  • Purchases (PURCHASES): Customers in this group make purchases at an average level, however, they make these purchases both at once and in installments.

  • Cash Advance (CASH_ADVANCE): There is an average level of cash advance usage.

  • Purchase Frequency (PURCHASES_FREQUENCY): These customers shop regularly, but not very frequently.

  • Credit Limit (CREDIT_LIMIT): The credit limit of this group may be lower compared to other groups.

  • They can be evaluated as average users. These users can use their credit cards for both daily shopping and larger purchases. The medium level of cash advance usage shows that they sometimes meet their cash needs from their cards.


Cluster 1

  • Balance (BALANCE): The customers of this group carry quite a high balance, which means they have a high spending capacity.

  • Purchases (PURCHASES): They shop less, which means this group is less active.

  • Cash Advance (CASH_ADVANCE): A high amount of cash advance usage is evident in this group, which shows these customers often have cash needs.

  • Purchase Frequency (PURCHASES_FREQUENCY): These customers shop less frequently.

  • Credit Limit (CREDIT_LIMIT): The credit limit of this group is higher than the other groups.

  • This group seems to include individuals who use the credit card for larger purchases or high-priced acquisitions. High balances and credit limits indicate that this group has a high financial capacity and spending potential.


Cluster 2

  • Balance (BALANCE): The balance of this group is high, but not as much as Cluster 1.

  • Purchases (PURCHASES): Customers in this group make very high amounts of purchases and these purchases are made both at once and in installments.

  • Cash Advance (CASH_ADVANCE): Cash advance usage is slightly above average.

  • Purchase Frequency (PURCHASES_FREQUENCY): This customer group shops frequently.

  • Credit Limit (CREDIT_LIMIT): The credit limit of this group is average compared to the other two groups.

  • There's a high probability that this group consists of young individuals or students, this group may represent individuals who use their cards for daily small expenses. Hence, this could be the reason for their frequent shopping and generally small amount expenditures.

However, to increase the accuracy of these interpretations, additional demographic or behavioral information (such as age, occupation, income level) may be needed. Such additional information could help us describe each cluster more detailed and accurately. Unfortunately, these are not available in the dataset I worked on. Since I want to use pydash in the continuation of the project, I will add age, gender, and city headings (I had added the city earlier). These additions are completely random and are necessary for me to proceed without affecting this analysis.

You can check the dash application I made to turn this analysis process into an app and dashboard here.

Mehmet Emre Toktay
  • emre.toktay@pardus.tech
  • EmreToktay
  • tnazburka
  • memretoktay

Mehmet Emre Toktay is an analyst with deep knowledge in customer relations. He has worked as a customer experience and business intelligence specialist and is now eager to transition into a data scientist role. Enthusiastic about data, statistics, and continuous improvement. He has a passion for uncovering insights and solving complex problems.