Calinski–Harabasz index
The Calinski–Harabasz index (CHI), also known as the Variance Ratio Criterion (VRC), is a metric for evaluating clustering algorithms, introduced by Tadeusz Caliński and Jerzy Harabasz in 1974.[1] It is an internal evaluation metric, where the assessment of the clustering quality is based solely on the dataset and the clustering results, and not on external, ground-truth labels.
Definition
[edit]Given a data set of n points: {x1, ..., xn}, and the assignment of these points to k clusters: {C1, ..., Ck}, the Calinski–Harabasz (CH) Index is defined as the ratio of the between-cluster separation (BCSS) to the within-cluster dispersion (WCSS), normalized by their number of degrees of freedom:
BCSS (Between-Cluster Sum of Squares) is the weighted sum of squared Euclidean distances between each cluster centroid (mean) and the overall data centroid (mean):
where ni is the number of points in cluster Ci, ci is the centroid of Ci, and c is the overall centroid of the data. BCSS measures how well the clusters are separated from each other (the higher the better).
WCSS (Within-Cluster Sum of Squares) is the sum of squared Euclidean distances between the data points and their respective cluster centroids:
WCSS measures the compactness or cohesiveness of the clusters (the smaller the better). Minimizing the WCSS is the objective of centroid-based clustering algorithms such as k-means.
Explanation
[edit]The numerator of the CH index is the between-cluster separation (BCSS) divided by its degrees of freedom. The number of degrees of freedom of BCSS is k - 1, since fixing the centroids of k - 1 clusters also determines the kth centroid, as its value makes the weighted sum of all centroids match the overall data centroid.
The denominator of the CH index is the within-cluster dispersion (WCSS) divided by its degrees of freedom. The number of degrees of freedom of WCSS is n - k, since fixing the centroid of each cluster reduces the degrees of freedom by one. This is because given a centroid ci of cluster Ci, the assignment of ni - 1 points to that cluster also determines the assignment of the nith point, since the overall mean of the points assigned to the cluster should be equal to ci.
Dividing both the BCSS and WCSS by their degrees of freedom helps to normalize the values, making them comparable across different numbers of clusters. Without this normalization, the CH index could be artificially inflated for higher values of k, making it hard to determine whether an increase in the index value is due to genuinely better clustering or just due to the increased number of clusters.
A higher value of CH indicates a better clustering, because it means that the data points are more spread out between clusters than they are within clusters.
Although there is no satisfactory probabilistic foundation to support the use of CH index, the criterion has some desirable mathematical properties as shown in.[1] For example, in the special case of equal distances between all pairs of points, the CH index is equal to 1. In addition, it is analogous to the F-test statistic in univariate analysis.
Liu et al. [2] discuss the effectiveness of using CH index for cluster evaluation relative to other internal clustering evaluation metrics. Maulik and Bandyopadhyay [3] evaluate the performance of three clustering algorithms using four cluster validity indices, including Davies–Bouldin index, Dunn index, Calinski–Harabasz index and a newly developed index. Wang et al. [4] have suggested an improved index for clustering validation based on Silhouette indexing and Calinski–Harabasz index.
Finding the optimal number of clusters
[edit]Similar to other clustering evaluation metrics such as Silhouette score, the CH index can be used to find the optimal number of clusters k in algorithms like k-means, where the value of k is not known a priori. This can be done by following these steps:
- Perform clustering for different values of k.
- Compute the CH index for each clustering result.
- The value of k that yields the maximum CH index is chosen as the optimal number of clusters.
Implementations
[edit]The scikit-learn Python library provides an implementation of this metric in the sklearn.metrics module.[5]
R provides a similar implementation in its fpc package.[6]
See also
[edit]References
[edit]- ^ a b Caliński, Tadeusz; Harabasz, Jerzy (1974). "A dendrite method for cluster analysis". Communications in Statistics. 3 (1): 1–27. doi:10.1080/03610927408827101.
- ^ Yanchi, Liu; Zhongmou, Li; Hui, Xiong; Xuedong, Gao; Junjie, Wu (2010). "Understanding of Internal Clustering Validation Measures". 2010 IEEE International Conference on Data Mining. pp. 911–916. doi:10.1109/ICDM.2010.35. ISBN 978-1-4244-9131-5. S2CID 8298336.
- ^ Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some clustering algorithms and validity indices." IEEE Transactions on pattern analysis and machine intelligence 24, no. 12 (2002): 1650-1654.
- ^ Wang, Xu, and Yusheng Xu. "An improved index for clustering validation based on Silhouette index and Calinski–Harabasz index." In IOP Conference Series: Materials Science and Engineering, vol. 569, no. 5, p. 052024. IOP Publishing, 2019.
- ^ "sklearn.metrics.calinski_harabasz_score". scikit-learn. Retrieved 2023-10-29.
- ^ "R: Calinski–Harabasz index". search.r-project.org. Retrieved 2023-10-29.