04/08/2023 • Mehmet Emre Toktay Bank Customer Segmentation: Analysis with K-Means, DBscan, and OPTICS Table of Contents 1. Introduction 2. Initial Review of the Data Set 3. Preprocessing 4. Exploratory Data Analysis (EDA) 5. Clustering Models 5.1 K-Means 5.1.1 K-Means Normal 5.1.2 K-Means Outlier Adjusted K-Means 5.1.3 K-Means UMAP 5.1.4 K-Means PCA 5.2 DBSCAN 5.2.1 DBSCAN Normal 5.2.2 DBSCAN UMAP 5.2.3 DBSCAN PCA 5.3 OPTICS 5.3.1 OPTICS Normal 5.3.2 OPTICS UMAP 5.3.3 OPTICS PCA 6. Results KMeans Normal UMAP PCA Outlier-free DBSCAN OPTICS Normal: UMAP & PCA: Summary Predicting/interpreting clusters Cluster 0 Cluster 1 Cluster 2 Bank Customer Segmentation: Analysis with K-Means, DBscan, and OPTICS 1. Introduction In this study, a data set containing the usage behavior of approximately 9000 credit card users in the last six months will be used. Dividing customer behavior into several groups is necessary to obtain an effective and efficient credit card marketing strategy. The objectives of this notebook are as follows: - Explore the data set using data visualization techniques. - Perform data preprocessing before using models. - Group customers into clusters using different clustering models. - Interpret and analyze the created groups (profiling). - Provide marketing suggestions based on the profiling results and analyses. Code import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.cluster import KMeans, OPTICS from sklearn.preprocessing import StandardScaler from pyclustertend import hopkins from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, silhouette_samples from scipy.spatial.distance import pdist, squareform from mpl_toolkits.mplot3d import Axes3D from yellowbrick.cluster import SilhouetteVisualizer import umap from sklearn.neighbors import NearestNeighbors import matplotlib.cm as cm from sklearn.cluster import DBSCAN Code url = 'https://raw.githubusercontent.com/EmreToktay/pydash/main/Creditcard.csv' df = pd.read_csv(url) 2. Initial Review of the Data Set First, let's get acquainted with the data set we have at hand; Code print(df.shape) df.head() (8950, 19) CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE City 0 C10001 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12 Bursa 1 C10002 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12 Şanlıurfa 2 C10003 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12 Çorum 3 C10004 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12 Balıkesir 4 C10005 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12 Batman Some detail regarding the headers; BALANCE: The available usage of your credit card BALANCE_FREQUENCY: How frequently the balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES: The amount of shopping made from the account ONEOFF_PURCHASES: The highest shopping amount made at one time INSTALLMENTS_PURCHASES: The amount of shopping made in installments CASH_ADVANCE: Cash Advance PURCHASES_FREQUENCY: How frequently shopping is done, score between 0 and 1 (1 = shopping is done frequently, 0 = shopping is not done frequently) ONEOFF_PURCHASES_FREQUENCY: How frequently one-off purchases are made (1 = frequently done, 0 = not frequently done) PURCHASES_INSTALLMENTS_FREQUENCY: How frequently installment purchases are made (1 = frequently done, 0 = not frequently done) CASH_ADVANCE_FREQUENCY:How frequently the cash taken in advance is paid CASH_ADVANCE_TRX: The number of transactions made with “Cash Advance” PURCHASES_TRX: The number of shopping transactions made CREDIT_LIMIT: The credit card limit of the user PAYMENTS: The amount of payment made by the user MINIMUM_PAYMENTS: The amount of minimum payment made by the user PRC_FULL_PAYMENT: The percentage of full payment made by the user TENURE: The tenure of the user's credit card service Let's examine the data types; Code df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 8950 non-null object 1 BALANCE 8950 non-null float64 2 BALANCE_FREQUENCY 8950 non-null float64 3 PURCHASES 8950 non-null float64 4 ONEOFF_PURCHASES 8950 non-null float64 5 INSTALLMENTS_PURCHASES 8950 non-null float64 6 CASH_ADVANCE 8950 non-null float64 7 PURCHASES_FREQUENCY 8950 non-null float64 8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64 9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64 10 CASH_ADVANCE_FREQUENCY 8950 non-null float64 11 CASH_ADVANCE_TRX 8950 non-null int64 12 PURCHASES_TRX 8950 non-null int64 13 CREDIT_LIMIT 8949 non-null float64 14 PAYMENTS 8950 non-null float64 15 MINIMUM_PAYMENTS 8637 non-null float64 16 PRC_FULL_PAYMENT 8950 non-null float64 17 TENURE 8950 non-null int64 18 City 8950 non-null object dtypes: float64(14), int64(3), object(2) memory usage: 1.3+ MB Basic statistical values for each header: Code df.describe() BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8949.000000 8950.000000 8637.000000 8950.000000 8950.000000 mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458 0.364437 0.135144 3.248827 14.709832 4494.449450 1733.143852 864.206542 0.153715 11.517318 std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336 0.397448 0.200121 6.824647 24.857649 3638.815725 2895.063757 2372.446607 0.292499 1.338331 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.019163 0.000000 6.000000 25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000 0.000000 0.000000 0.000000 1.000000 1600.000000 383.276166 169.123707 0.000000 12.000000 50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333 0.166667 0.000000 0.000000 7.000000 3000.000000 856.901546 312.343947 0.000000 12.000000 75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000 0.750000 0.222222 4.000000 17.000000 6500.000000 1901.134317 825.485459 0.142857 12.000000 max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000 1.000000 1.500000 123.000000 358.000000 30000.000000 50721.483360 76406.207520 1.000000 12.000000 Based on the percentile values of each feature in the table above, it may be concluded that CASH_ADVANCE_TRX, PURCHASES_TRX and TENURE features may not be continuous. However, they should be checked to prove this hypothesis. Looking at the count of each feature, there are some missing values in the CREDIT_LIMIT and MINIMUM_PAYMENTS columns. Looking at the percentiles, the distributions of some features are heavily skewed and require more detailed analysis. In conclusion, visualizing and analyzing the charts can help in understanding the data more in-depth. 3. Preprocessing Analysis and proportion of null values in the dataset: Code df.isna().mean()*100 CUST_ID 0.000000 BALANCE 0.000000 BALANCE_FREQUENCY 0.000000 PURCHASES 0.000000 ONEOFF_PURCHASES 0.000000 INSTALLMENTS_PURCHASES 0.000000 CASH_ADVANCE 0.000000 PURCHASES_FREQUENCY 0.000000 ONEOFF_PURCHASES_FREQUENCY 0.000000 PURCHASES_INSTALLMENTS_FREQUENCY 0.000000 CASH_ADVANCE_FREQUENCY 0.000000 CASH_ADVANCE_TRX 0.000000 PURCHASES_TRX 0.000000 CREDIT_LIMIT 0.011173 PAYMENTS 0.000000 MINIMUM_PAYMENTS 3.497207 PRC_FULL_PAYMENT 0.000000 TENURE 0.000000 City 0.000000 dtype: float64 There are empty values in the Minimum_payments and Credit_Limits, and we are addressing these for the continuation of the analysis; Code df.loc[(df['MINIMUM_PAYMENTS'].isnull()==True),'MINIMUM_PAYMENTS']=df['MINIMUM_PAYMENTS'].mean() df.loc[(df['CREDIT_LIMIT'].isnull()==True),'CREDIT_LIMIT']=df['CREDIT_LIMIT'].mean() Code df.isna().mean()*100 CUST_ID 0.0 BALANCE 0.0 BALANCE_FREQUENCY 0.0 PURCHASES 0.0 ONEOFF_PURCHASES 0.0 INSTALLMENTS_PURCHASES 0.0 CASH_ADVANCE 0.0 PURCHASES_FREQUENCY 0.0 ONEOFF_PURCHASES_FREQUENCY 0.0 PURCHASES_INSTALLMENTS_FREQUENCY 0.0 CASH_ADVANCE_FREQUENCY 0.0 CASH_ADVANCE_TRX 0.0 PURCHASES_TRX 0.0 CREDIT_LIMIT 0.0 PAYMENTS 0.0 MINIMUM_PAYMENTS 0.0 PRC_FULL_PAYMENT 0.0 TENURE 0.0 City 0.0 dtype: float64 We are dropping two headers that are not necessary for Customer Segmentation; these headers will be added back after the analysis. Code df = df.drop(columns=['City', 'CUST_ID']) 4. Exploratory Data Analysis (EDA) 🔍 To visualize the data, I used the Sweetviz library. Regarding Sweetviz; it is a data visualization library for Python. It is designed for the purpose of Exploratory Data Analysis (EDA). It quickly visualizes one or two data frames, producing detailed and visually appealing reports. Code import sweetviz as sv # Veriyi analiz edin report = sv.analyze(df) # Raporu doğrudan Jupyter Notebook'ta göstermek için aşağıdaki kodu kullanabilirsiniz (isteğe bağlı) report.show_notebook() Except for PURCHASES_FREQUENCY, none of the headers have a balanced distribution. Moreover, even PURCHASES_FREQUENCY does not have a normal distribution! As mentioned in the previous section, there is a significant skewness in some headers. As a result, a more detailed examination is needed to identify outliers. Also, conventional clustering algorithms may not be efficient enough. Data scaling is often used to ensure machine learning algorithms operate more effectively. Specifically, many algorithms can produce misleading results due to different scales among features. Therefore, it's important to bring all features to a similar scale. The scaling method we used is StandardScaler. StandardScaler scales each feature to have a mean of 0 and a standard deviation of 1. This is also known as z-score normalization. Code scaler = StandardScaler() df_scaled = scaler.fit_transform(df) With this code, our data has been successfully scaled and is ready for model training or other analyses. Hopkins Statistic Before starting the clustering analysis, it is important to check whether the dataset is suitable for clustering. For this purpose, the Hopkins Statistic is used to check whether the dataset is randomly distributed. The Hopkins Statistic is a test that measures how suitable a dataset is for clustering. A value close to 0.5 indicates that the data has a random distribution, and therefore is not suitable for clustering. On the other hand, if the value is greater than 0.7, the dataset is likely suitable for clustering. Code hopkins_score = hopkins(df_scaled, sampling_size=len(df_scaled)) print(hopkins_score) 0.034992587280613066 According to this result, the value of 0.0351 is quite low for the Hopkins Statistic, indicating that the dataset does not have a clear structure or cluster, and thus may not be suitable for clustering. Having previously looked at the overall structure of the dataset, we had seen excessive outliers and skewness, so using the Hopkins score over the entire data could be misleading. Therefore, I applied the Hopkins test again on a random 10% of the data; Code import numpy as np from random import sample from sklearn.neighbors import NearestNeighbors from numpy.random import uniform from math import isnan def hopkins(X): d = X.shape[1] n = len(X) m = int(0.1 * n) nbrs = NearestNeighbors(n_neighbors=1).fit(X) rand_X = sample(range(0, n, 1), m) ujd = [] wjd = [] for j in range(0, m): u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True) ujd.append(u_dist[0][1]) w_dist, _ = nbrs.kneighbors(X[rand_X[j]].reshape(1, -1), 2, return_distance=True) wjd.append(w_dist[0][1]) H = sum(ujd) / (sum(ujd) + sum(wjd)) if isnan(H): print(ujd, wjd) H = 0 return H # Hopkins Testini Gerçekleştir hopkins_value = hopkins(df_scaled) hopkins_result = 'Sonuç: {:.4f}'.format(hopkins_value) print('.: Hopkins Testi :.') print(hopkins_result) if 0.7 < hopkins_value < 0.99: print('>> Yukarıdaki sonuca göre, yüksek bir Clusterleme eğilimi var (anlamlı kümeler içeriyor)') print('.:. Sonuçlar: H0 Kabul .:.') else: print('>> Yukarıdaki sonuca göre, anlamlı kümeler yok') print('.:. Sonuçlar: H0 Red .:.') .: Hopkins Testi :. Sonuç: 0.9672 >> Yukarıdaki sonuca göre, yüksek bir kümeleme eğilimi var (anlamlı kümeler içeriyor) .:. Sonuçlar: H0 Kabul .:. In the initial Hopkins statistic test conducted on the entire dataset, we obtained a low value regarding how clustering-prone the data is compared to a random distribution. This result might indicate that at first glance, the data does not contain distinct clusters or the separation of clusters is challenging. The result of the Hopkins test conducted on a random 10% subset showed that the data has a high tendency for clustering. Consequently, in the analysis of extensive data sets, there are advantages to conducting analysis both on the entire data set and on random sub-sets. The results of both approaches provide a more holistic understanding of the overall structure of the dataset and its clustering tendency. It's time to examine the average of the outliers found in the dataset. To detect outliers, we used the IQR (Interquartile Range) method. ; Code outlier_percentage = {} for feature in df: tempData = df.sort_values(by=feature)[feature] Q1, Q3 = tempData.quantile([0.25, 0.75]) IQR = Q3 - Q1 Lower_range = Q1 - (1.5 * IQR) Upper_range = Q3 + (1.5 * IQR) outlier_count = ((tempData < Lower_range) | (tempData > Upper_range)).sum() outlier_perc = round((outlier_count / tempData.shape[0]) * 100, 2) outlier_percentage[feature] = outlier_perc outlier_percentage {'BALANCE': 7.77, 'BALANCE_FREQUENCY': 16.68, 'PURCHASES': 9.03, 'ONEOFF_PURCHASES': 11.32, 'INSTALLMENTS_PURCHASES': 9.69, 'CASH_ADVANCE': 11.51, 'PURCHASES_FREQUENCY': 0.0, 'ONEOFF_PURCHASES_FREQUENCY': 8.74, 'PURCHASES_INSTALLMENTS_FREQUENCY': 0.0, 'CASH_ADVANCE_FREQUENCY': 5.87, 'CASH_ADVANCE_TRX': 8.98, 'PURCHASES_TRX': 8.56, 'CREDIT_LIMIT': 2.77, 'PAYMENTS': 9.03, 'MINIMUM_PAYMENTS': 8.65, 'PRC_FULL_PAYMENT': 16.47, 'TENURE': 15.26} According to these results, in some features; for instance, BALANCE_FREQUENCY, PRC_FULL_PAYMENT, and TENURE the percentage of outliers is quite high. It's necessary to evaluate the effects of these features on the dataset and how to handle these outliers. On the other hand, in some features; there are no outliers in PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY. Outliers could have adverse effects on our analyses and modelings, hence, it's important to develop strategies regarding these values. These strategies could be in the form of deletion, transformation, or handling them via another method. It's not always obligatory to deal with outliers as in some cases it could even be misleading. However, in the subsequent analyses, I created a dataset where actions have been taken on outliers, thus had the opportunity to examine different results. Code def preprocess_data(data): feature_boundaries = { 'BALANCE': [0, 500, 1000, 3000, 5000, 10000], 'PURCHASES': [0, 500, 1000, 3000, 5000, 10000], 'ONEOFF_PURCHASES': [0, 500, 1000, 3000, 5000, 10000], 'INSTALLMENTS_PURCHASES': [0, 500, 1000, 3000, 5000, 10000], 'CASH_ADVANCE': [0, 500, 1000, 3000, 5000, 10000], 'CREDIT_LIMIT': [0, 500, 1000, 3000, 5000, 10000], 'PAYMENTS': [0, 500, 1000, 3000, 5000, 10000], 'MINIMUM_PAYMENTS': [0, 500, 1000, 3000, 5000, 10000] } frequency_boundaries = { 'BALANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'ONEOFF_PURCHASES_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'PURCHASES_INSTALLMENTS_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'CASH_ADVANCE_FREQUENCY': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'PRC_FULL_PAYMENT': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] } trx_boundaries = { 'PURCHASES_TRX': [0, 5, 10, 15, 20, 30, 50, 100], 'CASH_ADVANCE_TRX': [0, 5, 10, 15, 20, 30, 50, 100] } def assign_range(column, boundaries): new_column = column + '_RANGE' data[new_column] = 0 for idx, boundary in enumerate(boundaries): if idx == len(boundaries) - 1: data.loc[data[column] > boundary, new_column] = idx + 1 else: data.loc[(data[column] > boundary) & (data[column] <= boundaries[idx + 1]), new_column] = idx + 1 for column, boundaries in feature_boundaries.items(): assign_range(column, boundaries) for column, boundaries in frequency_boundaries.items(): assign_range(column, boundaries) for column, boundaries in trx_boundaries.items(): assign_range(column, boundaries) columns_to_drop = [ 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT' ] df_wo = df.drop(columns=columns_to_drop) return df_wo Code df_wo = preprocess_data(df) This data preprocessing function has been created to categorize certain columns into categorical ranges. Using predefined boundaries for these columns, new features have been derived based on these boundaries. While retaining the numerical values of the relevant features, new columns have been created to determine the categorical ranges corresponding to these values. Feature Boundaries: Boundaries representing value ranges have been determined for certain columns. For example, for the ‘BALANCE’ column, the boundaries have been defined as: 0, 500, 1000, 3000, 5000, 10000. These boundaries are used to determine in which range the values of a particular feature fall. Implementation of Boundaries: A helper function named assign_range has been used for each column. This function takes the values of a particular column and converts these values into a categorical column based on the defined boundaries. For example, an observation with a value of 750 in the ‘BALANCE’ column will have a value of 2 in the ‘BALANCE_RANGE’ column, as this value falls between 500 and 1000. Removal of Unnecessary Columns: The original numerical columns have become redundant once the columns are converted into categorical ranges. Therefore, these columns have been removed from the dataset. In conclusion, this function is used to convert certain columns into categorical ranges, and it removes the original numerical columns from the dataset following this conversion. 5. Clustering Models In the review, we will try different clustering methods with different methodologies to find the most ideal one. For clustering models, I will use the K-Means, DBSCAN, and OPTICS methods. For K-means, I will use the unprocessed dataset, the dataset processed for outliers, and the datasets applied with PCA and UMAP. As DBSCAN and OPTICS are not significantly affected by outliers by nature, results will be obtained using the normal, PCA, and UMAP datasets. Code results = { "KMeans": { "normal": {}, "umap": {}, "pca": {}, "outlier_free": {} }, "DBSCAN": { "normal": {}, "umap": {}, "pca": {} }, "OPTICS": { "normal": {}, "umap": {}, "pca": {} } } This code constructs a structure to store the results and performance metrics of different clustering algorithms. Specifically, we are evaluating the results of KMeans, DBSCAN and OPTICS clustering algorithms. Additionally, we will also test these algorithms on normal, UMAP reduced, PCA reduced, and outlier-free datasets. The schema of the roadmap I made for the analysis can be seen in this code. To briefly mention. KMeans: It is an algorithm to cluster the data into a certain "k" number of clusters. DBSCAN: It is a density-based clustering algorithm and does not require specifying the number of clusters in advance. OPTICS: It is a generalized version of DBSCAN, a density-based clustering algorithm. UMAP: It is a dimension reduction method used to reduce the data to a lower-dimensional space. PCA: It is the abbreviation for Principal Component Analysis and is used to represent the data with fewer features. Silhouette Coefficient: A metric measuring the internal cohesion of a cluster and the separation between clusters. Calinski-Harabasz Index: A score measuring the quality of cluster dispersion. Davies-Bouldin Index: It measures the ratio of intra-cluster similarities to inter-cluster similarities. Lower values indicate better clustering. 5.1 K-Means K-Means is a clustering algorithm used to divide the data into ‘k’ number of clusters. The algorithm operates by randomly initializing each cluster center, and then iteratively updating these centers to be the average of the data points in the clusters. This process continues until a defined criterion (e.g., a situation where the centers do not move) is met. The elbow method and Silhouette score were used as supportive to the K-Means process; Elbow Method One of the toughest decisions for the K-means algorithm is determining how many clusters the data should be divided into. The elbow method is a common technique used for this decision. This method involves calculating the total within-cluster sum of squares (WCSS) for different k values. As the value of k increases, WCSS decreases. However, after a certain k value, this reduction amount becomes insignificant. This point called the "elbow" helps us to determine the optimal k value. Silhouette Score The Silhouette score measures how well the clusters are defined. This score takes values between -1 and 1. A value close to 1 indicates that the points are similar within their clusters and different from other clusters, while a value close to -1 indicates that these points are clustered incorrectly. A value of 0 indicates uncertainty about how far apart the clusters are from each other. 5.1.1 K-Means Normal Code k_values = range(1, 11) wcss = [] silhouette_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_scaled) wcss.append(kmeans.inertia_) if k > 1: silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_)) else: silhouette_scores.append(0) fig, ax1 = plt.subplots(figsize=(12, 7)) ax1.set_xlabel('Küme Sayısı (k)') ax1.set_ylabel('WCSS', color='tab:blue') ax1.plot(k_values, wcss, 'o-', color='tab:blue') ax1.tick_params(axis='y', labelcolor='tab:blue') ax2 = ax1.twinx() ax2.set_ylabel('Silhouette Skoru', color='tab:orange') ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange') ax2.tick_params(axis='y', labelcolor='tab:orange') fig.tight_layout() plt.title('Dirsek Yöntemi ve Silhouette Skoru') plt.show() Evaluation of the Graph Consider the following when evaluating the graph: Elbow Point: When examining the WCSS graph (blue line), look for the k value where WCSS does not decrease significantly. This point can serve as a clue to determine the optimal k value. Silhouette Score: When examining the silhouette score graph (orange line), look for the k value with the highest score. A high silhouette score indicates well-defined clusters. By using these two measurements together, you can determine the most suitable k value for the dataset. Code kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_scaled) labels = kmeans.labels_ fig = plt.figure(figsize=(10, 8)) ax = fig.add_subplot(111, projection='3d') colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'] for i, color in enumerate(colors): ax.scatter(df_scaled[labels == i, 0], df_scaled[labels == i, 1], df_scaled[labels == i, 2], c=color, label=f'Cluster {i+1}', s=50) ax.set_title("3D Scatter Plot of Clusters") ax.set_xlabel("Feature 1") ax.set_ylabel("Feature 2") ax.set_zlabel("Feature 3") ax.legend() plt.show() silhouette_vals = silhouette_samples(df_scaled, labels) fig, ax = plt.subplots(1, 1, figsize=(8, 6)) y_lower = 10 for i in range(4): ith_cluster_silhouette_values = silhouette_vals[labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / 4) ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1)) y_lower = y_upper + 10 ax.set_title("Silhouette Plot for the Clusters") ax.set_xlabel("Silhouette Coefficient Values") ax.set_ylabel("Cluster Label") ax.set_yticks([]) ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--") plt.show() cluster_counts = np.bincount(labels) total_count = len(labels) percentages = (cluster_counts / total_count) * 100 plt.figure(figsize=(8, 8)) plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(4)], colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) plt.title("Percentage Distribution of Clusters") plt.show() 1. 3D Cluster Distribution Graph This graph shows how clusters are distributed in a 3-dimensional space in the scaled dataset. Each color represents a different cluster. Observation: If the clusters are distinctly separated from each other, it indicates that the K-Means algorithm has clustered well. 2. Silhouette Graph The silhouette graph evaluates how close each data point is to other data points in its own cluster, and how far it is from the nearest other cluster. Bands: Each band represents a specific cluster. The width of the band represents the number of data points in that cluster, while its height represents the silhouette score. Red Line: Average silhouette score. We can say that clusters with values close to or above this line are well-defined. Observation: If the bands are near or above the red line and the band widths are similar, it indicates that the clusters are balanced and well-defined. 3. Cluster Distribution Percentages Graph This pie chart shows the percentage distribution of data points falling into each cluster. Observation: If the slices are of similar size, it indicates a balanced distribution of clusters. However, if you see one or more slices significantly larger or smaller than the others, it indicates that certain clusters have more or fewer data points compared to others. Code kmeans = KMeans(n_clusters=3) labels = kmeans.fit_predict(df_scaled) results["KMeans"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels) results["KMeans"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels) results["KMeans"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels) for metric, value in results["KMeans"]["normal"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.25, Calinski-Harabasz Index: 1604.40, Davies-Bouldin Index: 1.60 Evaluation of Results You can find the meaning and how to evaluate the results obtained when the code is run below: Silhouette Coefficient: This value ranges between -1 and 1. A value closer to 1 indicates well-defined clusters, while a value closer to -1 indicates wrongly defined clusters. 0 indicates that clusters are close to each other or overlapping. Calinski-Harabasz Index: A higher value indicates better defined clusters. This criterion measures the ratio between the density and distribution of clusters. Davies-Bouldin Index: Lower values indicate better defined clusters. This index measures the similarity ratio for each cluster and takes the average of this ratio. By using these metrics, you can compare the quality of clustering results obtained with different algorithms or with different parameters of the same algorithm, and decide which approach is most suitable for your data. I will share the general evaluation of these results at the end of the analysis. 5.1.2 K-Means Outlier Adjusted K-Means Code df_wo = preprocess_data(df) Code k_values = range(1, 11) wcss = [] silhouette_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_wo) wcss.append(kmeans.inertia_) if k > 1: silhouette_scores.append(silhouette_score(df_wo, kmeans.labels_)) else: silhouette_scores.append(0) fig, ax1 = plt.subplots(figsize=(12, 7)) ax1.set_xlabel('Küme Sayısı (k)') ax1.set_ylabel('WCSS', color='tab:blue') ax1.plot(k_values, wcss, 'o-', color='tab:blue') ax1.tick_params(axis='y', labelcolor='tab:blue') ax2 = ax1.twinx() ax2.set_ylabel('Silhouette Skoru', color='tab:orange') ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange') ax2.tick_params(axis='y', labelcolor='tab:orange') fig.tight_layout() plt.title('Dirsek Yöntemi ve Silhouette Skoru') plt.show() Code kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_wo) labels = kmeans.labels_ fig = plt.figure(figsize=(10, 8)) ax = fig.add_subplot(111, projection='3d') colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] for i, color in enumerate(colors): ax.scatter(df_wo[labels == i].iloc[:, 0], df_wo[labels == i].iloc[:, 1], df_wo[labels == i].iloc[:, 2], c=color, label=f'Cluster {i+1}', s=50) ax.set_title("3D Scatter Plot of Clusters") ax.set_xlabel("Feature 1") ax.set_ylabel("Feature 2") ax.set_zlabel("Feature 3") ax.legend() plt.show() silhouette_vals = silhouette_samples(df_wo, labels) fig, ax = plt.subplots(1, 1, figsize=(8, 6)) y_lower = 10 for i in range(3): ith_cluster_silhouette_values = silhouette_vals[labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / 3) ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1)) y_lower = y_upper + 10 ax.set_title("Silhouette Plot for the Clusters") ax.set_xlabel("Silhouette Coefficient Values") ax.set_ylabel("Cluster Label") ax.set_yticks([]) ax.axvline(x=silhouette_score(df_wo, labels), color="red", linestyle="--") plt.show() cluster_counts = np.bincount(labels) total_count = len(labels) percentages = (cluster_counts / total_count) * 100 plt.figure(figsize=(8, 8)) plt.pie(percentages, labels=[f'Cluster {i+1}' for i in range(3)], colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) plt.title("Percentage Distribution of Clusters") plt.show() Code kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0) labels = kmeans.fit_predict(df_wo) results["KMeans"]["outlier_free"]["Silhouette Coefficient"] = silhouette_score(df_wo, labels) results["KMeans"]["outlier_free"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_wo, labels) results["KMeans"]["outlier_free"]["Davies-Bouldin Index"] = davies_bouldin_score(df_wo, labels) for metric, value in results["KMeans"]["outlier_free"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.32 Calinski-Harabasz Index: 4345.55 Davies-Bouldin Index: 1.42 5.1.3 K-Means UMAP UMAP (Uniform Manifold Approximation and Projection) UMAP (Uniform Manifold Approximation and Projection) is a modern dimensionality reduction technique used for visualizing high-dimensional data sets in a lower-dimensional space. It is known for being fast and scalable, especially for large datasets. Preservation of Local and Global Structure: UMAP aims to preserve both the local and global structures of the data. This is beneficial, especially for complex data structures. General Applications: UMAP is not only used for visualization but also for general dimensionality reduction applications. One of the biggest advantages of UMAP is that it operates faster and has more general-purpose features compared to other dimensionality reduction methods like t-SNE. Code reducer = umap.UMAP() embedding = reducer.fit_transform(df_scaled) df_umap = pd.DataFrame(embedding, columns=['UMAP 1', 'UMAP 2']) plt.figure(figsize=(12, 8)) plt.scatter(df_umap['UMAP 1'], df_umap['UMAP 2'], cmap='Spectral', s=5) plt.gca().set_aspect('equal', 'datalim') plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10)) plt.title('UMAP Projection', fontsize=24); plt.show() Code k_values = range(1, 11) wcss = [] silhouette_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_umap) wcss.append(kmeans.inertia_) if k > 1: silhouette_scores.append(silhouette_score(df_umap, kmeans.labels_)) else: silhouette_scores.append(0) fig, ax1 = plt.subplots(figsize=(12, 7)) ax1.set_xlabel('Küme Sayısı (k)') ax1.set_ylabel('WCSS', color='tab:blue') ax1.plot(k_values, wcss, 'o-', color='tab:blue') ax1.tick_params(axis='y', labelcolor='tab:blue') ax2 = ax1.twinx() ax2.set_ylabel('Silhouette Skoru', color='tab:orange') ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange') ax2.tick_params(axis='y', labelcolor='tab:orange') fig.tight_layout() plt.title('Dirsek Yöntemi ve Silhouette Skoru') plt.show() Code kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_umap) labels = kmeans.labels_ umap_array = df_umap.values fig = plt.figure(figsize=(10, 8)) colors = ['#1f77b4', '#ff7f0e'] for i, color in enumerate(colors): plt.scatter(umap_array[labels == i, 0], umap_array[labels == i, 1], c=color, label=f'Cluster {i+1}', s=50) plt.title("2D Scatter Plot of Clusters") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.legend() plt.show() silhouette_vals = silhouette_samples(df_umap, labels) fig, ax = plt.subplots(1, 1, figsize=(8, 6)) y_lower = 10 for i in range(2): ith_cluster_silhouette_values = silhouette_vals[labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / 2) ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1)) y_lower = y_upper + 10 ax.set_title("Silhouette Plot for the Clusters") ax.set_xlabel("Silhouette Coefficient Values") ax.set_ylabel("Cluster Label") ax.set_yticks([]) ax.axvline(x=silhouette_score(df_scaled, labels), color="red", linestyle="--") plt.show() Code kmeans_umap = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0) labels_umap = kmeans_umap.fit_predict(df_umap) results["KMeans"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) results["KMeans"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap) results["KMeans"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap) for metric, value in results["KMeans"]["umap"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.42 Calinski-Harabasz Index: 8208.00 Davies-Bouldin Index: 0.95 5.1.4 K-Means PCA PCA (Principal Component Analysis) PCA (Principal Component Analysis) is a statistical method used for dimensionality reduction in multivariate data sets by maximizing the variance among variables. Principal components are selected orthogonally to capture the main structure of the data. Variance Maximization: PCA selects the principal components to capture the maximum variance in the data set. Low-Dimensional Representation: It represents the data set with fewer features, making it useful for visualization and modeling. Explained Variance: It provides information on how much variance each component explains in the data set, thus making it easier to select the most important components. PCA is a popular dimensionality reduction method frequently used in data science and machine learning applications. Code from sklearn.decomposition import PCA pca = PCA(n_components=2) df_pca = pca.fit_transform(df_scaled) print(df_pca.shape) explained_variance = pca.explained_variance_ratio_ print(f"Explained Variance (first two components): {100 * explained_variance.sum():.2f}%") (8950, 2) Explained Variance (first two components): 47.59% Code k_values = range(1, 11) wcss = [] silhouette_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_pca) wcss.append(kmeans.inertia_) if k > 1: silhouette_scores.append(silhouette_score(df_pca, kmeans.labels_)) else: silhouette_scores.append(0) fig, ax1 = plt.subplots(figsize=(12, 7)) ax1.set_xlabel('Küme Sayısı (k)') ax1.set_ylabel('WCSS', color='tab:blue') ax1.plot(k_values, wcss, 'o-', color='tab:blue') ax1.tick_params(axis='y', labelcolor='tab:blue') ax2 = ax1.twinx() ax2.set_ylabel('Silhouette Skoru', color='tab:orange') ax2.plot(k_values, silhouette_scores, 'o-', color='tab:orange') ax2.tick_params(axis='y', labelcolor='tab:orange') fig.tight_layout() plt.title('Dirsek Yöntemi ve Silhouette Skoru') plt.show() Code kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(df_pca) kmeans_labels_pca = kmeans.labels_ fig = plt.figure(figsize=(10, 8)) colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] for i, color in enumerate(colors): plt.scatter(df_pca[kmeans_labels_pca == i, 0], df_pca[kmeans_labels_pca == i, 1], c=color, label=f'Cluster {i+1}', s=50) plt.title("2D Scatter Plot of Clusters") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.legend() plt.show() silhouette_vals = silhouette_samples(df_pca, kmeans_labels_pca) fig, ax = plt.subplots(1, 1, figsize=(8, 6)) y_lower = 10 for i in range(3): ith_cluster_silhouette_values = silhouette_vals[kmeans_labels_pca == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / 3) ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i+1)) y_lower = y_upper + 10 ax.set_title("Silhouette Plot for the Clusters") ax.set_xlabel("Silhouette Coefficient Values") ax.set_ylabel("Cluster Label") ax.set_yticks([]) ax.axvline(x=silhouette_score(df_scaled, kmeans_labels_pca), color="red", linestyle="--") plt.show() Code kmeans_pca = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans_labels_pca = kmeans_pca.fit_predict(df_pca) results["KMeans"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, kmeans_labels_pca) results["KMeans"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, kmeans_labels_pca) results["KMeans"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, kmeans_labels_pca) for metric, value in results["KMeans"]["pca"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.45 Calinski-Harabasz Index: 5337.49 Davies-Bouldin Index: 0.81 5.2 DBSCAN DBSCAN (Density-Based Spatial Clustering of Applications with Noise) DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It identifies dense regions of data points and assigns points in these regions to the same cluster. Points in non-dense regions are classified as noise. Working Principle: - A neighborhood of a certain radius (eps) is defined around each data point. If there are more than min_samples data points within this neighborhood, the point is considered part of a dense region. - Dense regions are merged to form broad clusters. - Points outside dense regions and without enough neighbors in their vicinity are labeled as noise. Advantages: - It makes no assumptions about the shapes of clusters. Therefore, clusters do not have to be circular. - It can automatically distinguish noise. - You do not have to specify the number of clusters beforehand. Challenges: - It may struggle to accurately separate clusters with different densities. - The eps and min_samples parameters need to be properly tuned. DBSCAN is a suitable option especially for noisy datasets and clusters with complex structures. 5.2.1 DBSCAN Normal Code X = df_scaled def epsilon(X): neighbors = NearestNeighbors(n_neighbors=2) nbrs = neighbors.fit(X) distances, indices = nbrs.kneighbors(X) distances = np.sort(distances, axis=0) distances_1 = distances[:, 1] plt.plot(distances_1, color='#5829A7') plt.xlabel('Total') plt.ylabel('Distance') for spine in plt.gca().spines.values(): spine.set_color('None') plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted') plt.grid(axis='x', alpha=0) plt.title('DBSCAN Epsilon Value for Scaled Data') plt.tight_layout() plt.show(); epsilon(X); Code dbscan = DBSCAN(eps=2.2, min_samples=5) labels = dbscan.fit_predict(df_scaled) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_scaled[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('DBSCAN Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.grid(True) plt.show() Code dbscan_normal = DBSCAN(eps=2.2, min_samples=5) labels_normal = dbscan_normal.fit_predict(df_scaled) results["DBSCAN"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0 results["DBSCAN"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal) results["DBSCAN"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal) for metric, value in results["DBSCAN"]["normal"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.52 Calinski-Harabasz Index: 939.76 Davies-Bouldin Index: 1.97 5.2.2 DBSCAN UMAP Code X = df_umap def epsilon(X): neighbors = NearestNeighbors(n_neighbors=2) nbrs = neighbors.fit(X) distances, indices = nbrs.kneighbors(X) distances = np.sort(distances, axis=0) distances_1 = distances[:, 1] plt.plot(distances_1, color='#5829A7') plt.xlabel('Total') plt.ylabel('Distance') for spine in plt.gca().spines.values(): spine.set_color('None') plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted') plt.grid(axis='x', alpha=0) plt.title('DBSCAN Epsilon Value for UMAP Data') plt.tight_layout() plt.show(); epsilon(X); Code dbscan = DBSCAN(eps=0.15, min_samples=2) labels = dbscan.fit_predict(df_umap.values) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_umap.values[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('DBSCAN Clustering') plt.xlabel('UMAP 1') plt.ylabel('UMAP 2') plt.grid(True) plt.show() Code dbscan_umap = DBSCAN(eps=2.23, min_samples=2) labels_umap = dbscan_umap.fit_predict(df_umap) results["DBSCAN"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0 results["DBSCAN"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap) results["DBSCAN"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap) for metric, value in results["DBSCAN"]["umap"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.25 Calinski-Harabasz Index: 232.31 Davies-Bouldin Index: 0.63 5.2.3 DBSCAN PCA Code X = df_pca def epsilon(X): neighbors = NearestNeighbors(n_neighbors=2) nbrs = neighbors.fit(X) distances, indices = nbrs.kneighbors(X) distances = np.sort(distances, axis=0) distances_1 = distances[:, 1] plt.plot(distances_1, color='#5829A7') plt.xlabel('Total') plt.ylabel('Distance') for spine in plt.gca().spines.values(): spine.set_color('None') plt.grid(axis='y', alpha=0.5, color='#9B9A9C', linestyle='dotted') plt.grid(axis='x', alpha=0) plt.title('DBSCAN Epsilon Value for PCA Data') plt.tight_layout() plt.show(); epsilon(X); Code dbscan = DBSCAN(eps=1, min_samples=5) labels = dbscan.fit_predict(df_pca) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_pca[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('DBSCAN Clustering') plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.grid(True) plt.show() Code dbscan_pca = DBSCAN(eps=1, min_samples=5) labels_pca = dbscan_pca.fit_predict(df_pca) results["DBSCAN"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0 results["DBSCAN"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca) results["DBSCAN"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca) for metric, value in results["DBSCAN"]["pca"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.80 Calinski-Harabasz Index: 1347.22 Davies-Bouldin Index: 0.78 5.3 OPTICS OPTICS (Ordering Points To Identify the Clustering Structure) OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm and can be considered as a generalization of DBSCAN. However, it is known for its ability to identify structures of different densities in the dataset, instead of using a single eps value. Working Principle: - The reachability distance is calculated for each data point. This determines how close a point is to another point at a certain density. - Based on these distances, a reachability graph is created. - Valleys in the graph indicate density-based clusters. Advantages: - It can detect clusters of varying densities. - You do not have to select a fixed value for the eps parameter. - It can automatically distinguish noise. Challenges: - The computational cost can be high, especially for large datasets. - Interpretation of the results might be slightly more challenging compared to DBSCAN. OPTICS is quite useful for datasets with regions of varying density. It is successful in detecting transitions between clusters of different densities. 5.3.1 OPTICS Normal Code optics = OPTICS(min_samples=5, max_eps=2.2) labels = optics.fit_predict(df_scaled) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_scaled[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('OPTICS Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.grid(True) plt.show() Code optics_normal = OPTICS(min_samples=9) labels_normal = optics_normal.fit_predict(df_scaled) results["OPTICS"]["normal"]["Silhouette Coefficient"] = silhouette_score(df_scaled, labels_normal) if len(np.unique(labels_normal)) > 1 else 0 results["OPTICS"]["normal"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_scaled, labels_normal) results["OPTICS"]["normal"]["Davies-Bouldin Index"] = davies_bouldin_score(df_scaled, labels_normal) for metric, value in results["OPTICS"]["normal"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: -0.45 Calinski-Harabasz Index: 11.93 Davies-Bouldin Index: 1.33 5.3.2 OPTICS UMAP Code optics = OPTICS(min_samples=2, max_eps=0.15) labels = optics.fit_predict(df_umap.values) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_umap.values[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('OPTICS Clustering') plt.xlabel('UMAP 1') plt.ylabel('UMAP 2') plt.grid(True) plt.show() Code optics_umap = OPTICS(min_samples=2) labels_umap = optics_umap.fit_predict(df_umap) results["OPTICS"]["umap"]["Silhouette Coefficient"] = silhouette_score(df_umap, labels_umap) if len(np.unique(labels_umap)) > 1 else 0 results["OPTICS"]["umap"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_umap, labels_umap) results["OPTICS"]["umap"]["Davies-Bouldin Index"] = davies_bouldin_score(df_umap, labels_umap) for metric, value in results["OPTICS"]["umap"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: 0.24 Calinski-Harabasz Index: 10.24 Davies-Bouldin Index: 1.43 5.3.3 OPTICS PCA Code optics = OPTICS(min_samples=5, max_eps=1) labels = optics.fit_predict(df_pca) plt.figure(figsize=(10, 8)) unique_labels = np.unique(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0.6, 0.6, 0.6, 1] class_member_mask = (labels == k) xy = df_pca[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('OPTICS Clustering') plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.grid(True) Code optics_pca = OPTICS(min_samples=5) labels_pca = optics_pca.fit_predict(df_pca) results["OPTICS"]["pca"]["Silhouette Coefficient"] = silhouette_score(df_pca, labels_pca) if len(np.unique(labels_pca)) > 1 else 0 results["OPTICS"]["pca"]["Calinski-Harabasz Index"] = calinski_harabasz_score(df_pca, labels_pca) results["OPTICS"]["pca"]["Davies-Bouldin Index"] = davies_bouldin_score(df_pca, labels_pca) for metric, value in results["OPTICS"]["pca"].items(): print(f"{metric}: {value:.2f}") Silhouette Coefficient: -0.18 Calinski-Harabasz Index: 10.06 Davies-Bouldin Index: 1.68 6. Results Code from IPython.display import display, Markdown def display_results(results): for method, data in results.items(): display(Markdown(f"### {method}")) for dataset, metrics in data.items(): display(Markdown(f"#### {dataset}")) for metric, value in metrics.items(): display(Markdown(f"- **{metric}**: {value:.2f}")) display_results(results) KMeans normal Silhouette Coefficient: 0.25 Calinski-Harabasz Index: 1604.40 Davies-Bouldin Index: 1.60 umap Silhouette Coefficient: 0.42 Calinski-Harabasz Index: 8208.00 Davies-Bouldin Index: 0.95 pca Silhouette Coefficient: 0.45 Calinski-Harabasz Index: 5337.49 Davies-Bouldin Index: 0.81 outlier_free Silhouette Coefficient: 0.32 Calinski-Harabasz Index: 4345.55 Davies-Bouldin Index: 1.42 DBSCAN normal Silhouette Coefficient: 0.52 Calinski-Harabasz Index: 939.76 Davies-Bouldin Index: 1.97 umap Silhouette Coefficient: 0.25 Calinski-Harabasz Index: 232.31 Davies-Bouldin Index: 0.63 pca Silhouette Coefficient: 0.80 Calinski-Harabasz Index: 1347.22 Davies-Bouldin Index: 0.78 OPTICS normal Silhouette Coefficient: -0.45 Calinski-Harabasz Index: 11.93 Davies-Bouldin Index: 1.33 umap Silhouette Coefficient: 0.24 Calinski-Harabasz Index: 10.24 Davies-Bouldin Index: 1.43 pca Silhouette Coefficient: -0.18 Calinski-Harabasz Index: 10.06 Davies-Bouldin Index: 1.68 Analysis of Clustering Results KMeans Normal Silhouette Coefficient (SC): 0.25 This metric ranges between -1 and 1. Higher values indicate that the object is well-fitted to its own cluster and has a weak match with neighboring clusters. The score of 0.25 suggests that the clusters overlap to some extent and there is room for improvement. Calinski-Harabasz Index (CHI): 1604.40 It is the ratio of within-cluster variance to between-cluster variance. Higher values indicate that clusters are dense and well-separated. Davies-Bouldin Index (DBI): 1.60 It is the average similarity ratio of each cluster with its most similar cluster. Values close to 0 are better. The score of 1.60 indicates a moderate clustering structure. UMAP SC: 0.41 An improvement showing better defined clusters compared to the normal data after UMAP is applied. CHI: 7902.77 A significant increase indicating that UMAP helped in improving the clustering structure. DBI: 0.97 A decrease indicating better separation of clusters after UMAP is applied. PCA SC: 0.45 Better cluster definition after PCA is applied, compared to UMAP and normal data. CHI: 5337.49 Lower than UMAP but higher than normal; indicates that PCA also aided in achieving a good clustering structure. DBI: 0.81 Indicates better separation of clusters compared to using normal data. Outlier-free (Without Outliers) Metrics in this category show that removing outliers made the clustering structure slightly better compared to normal data, but not as good as applying dimension reduction techniques like PCA or UMAP. DBSCAN DBSCAN generally produced better SC scores for normal and pca data clusters compared to KMeans. Especially with a 0.80 SC, the pca data set shows very well defined clusters. DBI scores also reflect better cluster separation, especially for the pca data set, compared to KMeans. However, the CHI score, while being better for the normal data set, drops significantly for the UMAP data set. This indicates less dense clusters or greater cluster overlap. OPTICS Normal: SC: -0.45 A negative silhouette score indicates that the clusters are not well-defined and overlap significantly. Other metrics also reflect a weak clustering structure. UMAP & PCA: Metrics for these data sets are not promising. Negative or low silhouette scores and very low CHI scores indicate poor cluster definition and density. Summary KMeans with UMAP or PCA KMeans combined with UMAP or PCA seems to yield the best results. Although PCA shows a slight advantage in silhouette coefficient, both methods indicate well-defined clusters. This could make it worthwhile to evaluate both approaches for a specific application. DBSCAN The combination of DBSCAN with PCA has a fairly high silhouette coefficient, indicating that this approach also defines clusters quite well. However, the results obtained with UMAP are not as successful for DBSCAN. OPTICS The performance of OPTICS is generally lower compared to the other two methods depending on the dataset used. Particularly, the negative silhouette coefficients indicate that the clusters are quite poorly defined. In conclusion, if your primary goal is to obtain well-defined clusters, it is recommended to combine KMeans with PCA or UMAP. On the other hand, it might be worth trying DBSCAN to deal with more complex structures, especially when combined with PCA. Despite the lower overall performance of OPTICS for this particular dataset, it might produce different results on different datasets, therefore, it's important to evaluate before using it. Code # Orijinal veri setinize etiketleri ekleyin. df['Cluster'] = kmeans_labels_pca # Her bir küme için sütun ortalamalarını alın. cluster_means = df.groupby('Cluster').mean() # Ortalama değerleri tablo olarak göster. cluster_means BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY ... PAYMENTS_RANGE MINIMUM_PAYMENTS_RANGE BALANCE_FREQUENCY_RANGE PURCHASES_FREQUENCY_RANGE ONEOFF_PURCHASES_FREQUENCY_RANGE PURCHASES_INSTALLMENTS_FREQUENCY_RANGE CASH_ADVANCE_FREQUENCY_RANGE PRC_FULL_PAYMENT_RANGE PURCHASES_TRX_RANGE CASH_ADVANCE_TRX_RANGE Cluster 0 789.381289 0.835555 516.904913 264.930837 252.295136 316.614221 0.474579 0.140900 0.348978 0.067312 ... 1.869153 1.404809 8.548250 5.000981 1.566405 3.695944 0.802257 1.672718 2.052993 0.443409 1 3925.835034 0.956794 361.745825 241.267864 120.553232 3791.893111 0.214590 0.106232 0.128552 0.437004 ... 3.077906 2.484480 9.663421 2.309191 1.191722 1.390140 4.780280 0.364577 1.125380 2.623859 2 2284.681931 0.981539 4378.858533 2754.504803 1624.856664 498.773452 0.950954 0.650455 0.768528 0.067055 ... 3.704107 1.886840 9.885163 9.659681 6.824811 7.924560 0.768650 3.176865 6.145013 0.464376 3 rows × 33 columns Predicting/interpreting clusters Cluster 0 Balance (BALANCE): The customers of this group carry an average level balance. This means they have neither very high nor very low spending capacity. Purchases (PURCHASES): Customers in this group make purchases at an average level, however, they make these purchases both at once and in installments. Cash Advance (CASH_ADVANCE): There is an average level of cash advance usage. Purchase Frequency (PURCHASES_FREQUENCY): These customers shop regularly, but not very frequently. Credit Limit (CREDIT_LIMIT): The credit limit of this group may be lower compared to other groups. They can be evaluated as average users. These users can use their credit cards for both daily shopping and larger purchases. The medium level of cash advance usage shows that they sometimes meet their cash needs from their cards. Cluster 1 Balance (BALANCE): The customers of this group carry quite a high balance, which means they have a high spending capacity. Purchases (PURCHASES): They shop less, which means this group is less active. Cash Advance (CASH_ADVANCE): A high amount of cash advance usage is evident in this group, which shows these customers often have cash needs. Purchase Frequency (PURCHASES_FREQUENCY): These customers shop less frequently. Credit Limit (CREDIT_LIMIT): The credit limit of this group is higher than the other groups. This group seems to include individuals who use the credit card for larger purchases or high-priced acquisitions. High balances and credit limits indicate that this group has a high financial capacity and spending potential. Cluster 2 Balance (BALANCE): The balance of this group is high, but not as much as Cluster 1. Purchases (PURCHASES): Customers in this group make very high amounts of purchases and these purchases are made both at once and in installments. Cash Advance (CASH_ADVANCE): Cash advance usage is slightly above average. Purchase Frequency (PURCHASES_FREQUENCY): This customer group shops frequently. Credit Limit (CREDIT_LIMIT): The credit limit of this group is average compared to the other two groups. There's a high probability that this group consists of young individuals or students, this group may represent individuals who use their cards for daily small expenses. Hence, this could be the reason for their frequent shopping and generally small amount expenditures. However, to increase the accuracy of these interpretations, additional demographic or behavioral information (such as age, occupation, income level) may be needed. Such additional information could help us describe each cluster more detailed and accurately. Unfortunately, these are not available in the dataset I worked on. Since I want to use pydash in the continuation of the project, I will add age, gender, and city headings (I had added the city earlier). These additions are completely random and are necessary for me to proceed without affecting this analysis. You can check the dash application I made to turn this analysis process into an app and dashboard here.