A review on k-means clustering
School of Computer Science and Software Engineering, The University of Western Australia, Crawley 6009, Australia Show
* Author to whom correspondence should be addressed. Original submission received: 29 May 2020/Revised: 29 July 2020/Accepted: 7 August 2020/Published: 12 August 2020 Abstract: The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions. 1. IntroductionThe advancements in computing, along with the rapid growth and availability of data repositories, have often emphasized the task of gaining meaningful insights from the data. This encourages taking appropriate measures based on knowledge discovery approaches. Machine Learning (ML) can be broadly classified into two: supervised and unsupervised learning. In the case of supervised learning, a function is learned that maps a given input to an output based on the available input–output pairs. These learning algorithms thus require the availability of labeled data with the desired output value []. The availability of labeled data represents an ideal scenario; however, such datasets are often expensive and challenging to obtain. For instance, in the intrusion detection domain, zero-day attacks are rare instances and obtaining labels for them is expensive. Hence, when the labels of the datasets are unavailable, unsupervised learning approaches are typically used []. Under such a learning framework, the algorithm has no prior knowledge of the true labels of the dataset and tries to draw inferences from the dataset itself. A more recent and popular set of algorithms referred to as semi-supervised learning consists of algorithms that lie in between supervised and unsupervised learning. Such a learning framework makes use of labeled and unlabeled data for better inference. Existing research on semi-supervised learning corroborates that adding a skimpy amount of labeled data together with a large amount of unlabeled data produces considerable improvements in the accuracy of the predictive model. This manuscript primarily deals with different variations of the k-means algorithm, which falls under the family of unsupervised learning. Therefore, this paper will focus only on unsupervised learning algorithms. Clustering algorithms exploit the underlying structure of the data distribution and define rules for grouping the data with similar characteristics []. This process results in the partition of a given dataset according to the clustering criteria without any prior knowledge about the dataset. In an ideal clustering scenario, each cluster consists of similar data instances that are quite dissimilar from the instances in other clusters. Such a dissimilarity measure relies on the underlying data and objective of the algorithm. Clustering is central to many data-driven applications and is considered an interesting and important task in machine learning. It is also studied in statistics, pattern recognition, computational geometry, bioinformatics, optimization, image processing, and in a variety of other fields [,,,]. A plethora of clustering techniques have been invented in the last decade, which are being applied in a wide range of application domains []. shows the recent applications on k-means clustering [] in different application domains. This survey studies the problems of and solutions to partition-based clustering, and more specifically the widely used k-means algorithm [], which has been listed among the top 10 clustering algorithms for data analysis []. Due to their popularity and ease of use, k-means clustering methods are being used in conjunction with deep learning for tasks such as image segmentation and handwriting recognition []. A more recent work in [] used a fully connected deep convolution neural network along with k-means and performed pixel matching between a segmented image and a convoluted image. This overcomes the problem that exists when useful information from images is lost by repeated convolution of the images []. Although the k-means clustering algorithm itself performs well with compact and hyper-spherical clusters, we are interested in highlighting its limitations and suggesting solutions. The primary focus is given to two unavoidable problems of the k-means algorithm: (i) assignment of centroids and number of clusters and (ii) ability to handle different types of data. Although researchers have proposed variants () of k-means algorithms to overcome these impediments, they are however domain specific and do not generalize well. For example, a k-means algorithm which can handle categorical data might perform poorly because of the initialization process used. outlines a comparative analysis of existing surveys on clustering and highlights the contributions of this paper. The key contributions of this paper are listed below.
Paper RoadmapOur paper is structured as follows. and address the problems of the k-means algorithm and review the existing solutions for them. showcases an experimental analysis of the representative algorithms (from and ) on six benchmark datasets widely used by the machine learning and data mining research community. concludes the paper, summarizing the key insights regarding the k-means algorithm. 2. k-means Variants for Solving the Problem of InitializationThe k-means algorithm depends on the value of k; which always needs to be specified in order to perform any clustering analysis. Clustering with different k values will eventually produce different results. Different initialization problems that were analyzed in recent studies did not consider the problem where the algorithm only converges to a poor local minima. In [], an alternative approach was adopted to prevent the k-means algorithm from being easily affected by noise and outlier values. The authors presented a modified k-means algorithm based on self-paced learning theory. Self-paced learning theory is used in order to select competing training subset that helps the k-means algorithm to build an initial cluster model. The generalization ability of the algorithm is then improved by subsequently adding training subsets found by self-paced learning until the model reaches an optimal performance or all the training samples have been used. The authors proposed this algorithm and demonstrated its performances on several real datasets. The authors in [] proposed an improvement to the vanilla k-means algorithm that prevents it from getting stuck to a local minima. The improved algorithm incorporates cuckoo search along with the k-means algorithm. The method incorporates a modified cuckoo search algorithm, which helps to reach a better solution since the search step size factor is being changed. Overall, the proposed algorithm is robust as well as fast in reaching a near-optimal solution. In [], the authors discussed criteria concerning the stability and the performance of the k-means. Theoretical results of algorithm stability were proven as the number of instances approaches infinity. The authors in [] analyzed the stability of the k-means clustering algorithm in a more practical scenario, the parameter (cluster number) is chosen by stability-based model selection. The primary focus was towards drawing random sample points to explain the effect on the stability of the algorithm. An initial estimation method for computing the covariance matrix for the k-means algorithm using Mahalanobis distance was proposed in []. It involves finding a group of points comprised of neighbors with a high density that represent the centroids of the selected clusters. These provide an approximate estimate of the covariance matrix, which is updated successively using the proposed algorithm. Ball et al. [] proposed the ISODATA algorithm to estimate k by parameter tweaking. A similar strategy is also adopted in adaptive resonance theory [], which generates new clusters []. Since this approach requires a predefined threshold, it is generally avoided. In x-means clustering [], the BIC (Bayesian information criterion) or “Schwarz criterion” [,] is utilised to find the number of k clusters. For a given set of data, a family of different models initialized with different k values is used to compute the posterior probability distribution. These distributions are then used to find the score of the different models. This algorithm was utilized in identifying anomalous patterns in [,]. Bradley et al. [] outlined methods for identifying a more fine grained starting condition that relies on the estimation of the distribution modes. The algorithm initially chooses small samples from the dataset and applies k-means clustering. These temporary sets are then further clustered using the k-means to find better initialization. This algorithm’s computational complexity is significantly lower compared to vanilla k-means and, more importantly, is scalable to large datasets. displays the methodology for obtaining refined initial points. In [], the author empirically compared four initialization methods. These are as follows: (i) RANDOM, (ii) FA [], (iii) MA [], and (iv) KA []. The RANDOM method divides the dataset randomly into k number of clusters and is one of the most popular initialization methods. The FA approach [] randomly chooses k instances from the given dataset and assigns the remaining instances to the nearest cluster centers. The MA approach [] randomly selects k instances of the dataset, similar to basic k-means. The KA approach [] provides a method where the initial clustering is done by selecting representative instances successively until k instances are found. The first representative instance is the most centrally located in the dataset. The remaining representative instances are chosen based on the heuristic that presents rules of choosing instances with higher numbers compared to others. Based on their experimental analysis, the RANDOM and KA resulted in superior performance compared to the other two. Recently, in [], a co-clustering algorithm is proposed for dealing with sparse and high dimensional data that outperforms the basic k-means. These solutions can only handle numerical data. Hence such methods would not be suitable for datasets consisting of mixed attributes, for instance, categorical and binary. In the next section, we discuss the research progress dealing with mixed types of data for clustering. 3. k-means Variants for Solving the Problem of Data IssueIn many application domains, datasets include attributes that are of mixed data types []. Therefore, a distance calculation scheme is necessary that is able to compute distances between instances having mixed types of data. This section discusses existing research on k-means to find similarity or distance metrics between categorical attributes. Although the algorithms discussed in the previous section solve the initialization problem, they still rely on numerical data for distance calculation. Therefore, it is necessary to explore the distance calculation mechanism for the k-means clustering algorithm and our investigation as follows. Wang et al. [] introduced an extended form of the fuzzy k-means (xFKM) algorithm for datasets with non-numerical attributes, such as categorical data. The centroids constitute an extended form to retain as much clustering information as possible. A computational cost comparison was made among the k-means, fuzzy k-means and xFKM. They showed that the xFKM is most suitable for datasets with categorical attributes, where the attributes vary in a medium and acceptable range of diverse values. Amir et al. [] offered a cost function and distance measure for clustering datasets with mixed data (datasets with numerical and categorical data) based on co-occurrences of values. In [], a kernel function based on “hamming distance” [] was proposed for embedding categorical data. The kernel-k-means provides an add-on to the k-means clustering that is designed to find clusters in a feature space where distances are calculated via kernel functions. Huang et al. in [] provided a variant of the k-means algorithm, where a dissimilarity measure is used as a metric. The algorithm combines the k-means and k-modes clustering for data having mixed attributes. In [], the convergence of different versions of the modified k-modes algorithms was analyzed. The authors proved that the modified algorithms fail to converge to a local minima without degrading the original k-mode algorithm. To handle this issue, the authors presented two algorithms, MKM-NOF and MKM-NDM, that apply different methods for representing clusters by weighted cluster prototypes. In [], an “ellipsoidal k-means” algorithm was proposed that extends the “spherical k-means” algorithm for selecting features in a sparse and high-dimensional dataset. In [], an entropy weighting k-means (EWKM) clustering algorithm was presented for clustering sparse datasets with high dimensions. The algorithm reduces the dispersion within the cluster and increases the negative weight entropy. Existing solutions for handling mixed data certainly enhanced the capability of the k-means; however, as the solutions do not address the initialization problem, several issues with k-means still remain. Among many proposed measures [,,], the overlap measure is considered to be the easiest and most computationally efficient method to find similarity between two categorical attributes []. One of the shortcomings of this method is inability to differentiate between the distinct attribute values, however, focuses on whether two categorical attributes attain equal values or not. However, after a thorough experimental analysis in [] on a number of datasets, the authors came to the conclusion that “No single measure is always superior or inferior. This is to be expected since each dataset has different characteristics”. Hence, the overlap measure (for categorical attributes) and Euclidean distance (for numerical attributes) can be combined into a mixed distance measure, as also used in [,]. 4. Performance Evaluation of k-means Representative VariantsWe performed an experimental analysis to investigate different versions of the k-means algorithms for different datasets, especially for the initialization and mixed-data-type problems (GitHub Link: shorturl.at/CR245). All datasets are available in te UCI machine learning repository []. briefly summarizes the datasets. The next subsections discuss the evaluation metrics used, results analysis and, last but not least, the analysis of computational complexity. 4.1. Metrics Used for Experimental AnalysisTo evaluate the performance of different algorithms, the following metrics [,,] were chosen:
4.2. ResultsExperimental analysis with k-means [], x-means [], constrained k-means [], k-prototype [], and kernel k-means [] was performed, and a performance comparison in terms of the metrics described above was made. For each of the experiments, a 10 fold cross validation scheme was used to split the dataset, and the reported results consist of average and standard deviation of the scores of these metrics across 10 validation folds. addresses the algorithms that deal with initialization issues, while encompasses mixed data type oriented methods. summarizes the mean and the standard deviation across five validation folds and compares the performance between regular k-means, x-means, and constrained k-means algorithms in terms of different metrics for the Wisconsin Diagnostic Breast cancer, KDD Cup 1999 (10%) and Epileptic Seizure datasets. In , it is shown that the algorithm performance varied with different datasets. For example, the k-means performed best with the KDD Cup datasets and x-means had the worst accuracy. However, the x-means performed better than the other two in the Epileptic dataset. The constrained-k-means performed best in the Wisconsin dataset. In terms of ARI score, the constrained-k-means seemed to perform in a consistent manner. The key takeaway from these results is that there is no algorithm that will provide a consistent solution regardless of the dataset. compares the performance between the k-prototype and kernel-k-means algorithms. These algorithms are suitable for datasets that consist of mixed attributes. In , it is reflected that the k-prototype performed better using the Credit Approval and Post Operative datasets when considering the accuracy of clustering. However, kernel-k-means performed best using the Cleveland Heart Disease dataset. In terms of ARI score comparison, the kernel-k-means algorithm consistently performed better than k-prototype using all three datasets containing mixed data. 4.3. Computational Complexity AnalysisIn addition to the comparison of these variants of k-means algorithms in terms of different metrics on different datasets, a time and space complexity comparison of these algorithms was also carried out. Both the time and space complexities of these algorithms depend on the size n of the datasets. Finding an optimal solution for the k-means algorithm is hard in the Euclidean space for both the binary and multi-class clustering problems. The regular k-means algorithm have a time complexity of O ( n 2 ) , where n is the size of the input data. However, it can be optimized to be linear and be in the order of O ( n ) using certain heuristics mentioned in [,]. By contrast, the time complexity for constrained-k-means is of the order O ( k n ) , where k represents the amount of clusters. It reflects that the constrained-k-means has a lower time complexity than the regular k-means when the dataset size grows. The x-means algorithm, on the other hand, was mainly proposed in order to address the scalability issue of the regular k-means algorithm for large datasets, since x-means involves a progressive increase in the number of clusters within a user-supplied range ( k m i n , k m a x ) . The time complexity of the x-means algorithm is therefore in the order of O ( n log k m a x ) . k m a x is the maximum number clusters possible in a dataset. The space complexity for all these k-means variants is O ( ( n + k ) d ) , where d is the number of features in a dataset. shows the complexities of the different clustering algorithms. 5. ConclusionsIn a wide range of application domains, data analysis tasks heavily rely on clustering. This paper focused on the popular k-means algorithm and the issues of initialization and inability to handle data with mixed types of features. Unlike other review or survey papers, this paper contains both a critical analysis of the existing literature and an experimental analysis on half a dozen benchmark datasets to demonstrate the performances of different variants of k-means. The experimental analysis divulged that there is no universal solution for the problems of the k-means algorithm; rather each of the existing variants of the algorithm is either application-specific or data-specific. Our future research will focus on developing a robust k-means algorithm that can address both problems simultaneously. This paper will also help the data mining research community to design and develop newer types of clustering algorithms that can address the research issues around Big Data [,]. Author ContributionsConceptualization, M.A.; Data curation, R.S. and M.A.; Formal analysis, M.A., R.S., and S.M.S.I.; Funding acquisition, M.A., R.S.; Investigation, M.A., R.S., and S.M.S.I.; Methodology, M.A; Project administration, M.A. and S.M.S.I.; Supervision, M.A. and S.M.S.I.; Validation, R.S. and M.A.; Visualization, M.A. and R.S.; and Writing—original draft, M.A. and R.S.; Writing—review and editing, M.A., R.S. and S.M.S.I. All authors have read and agreed to the published version of the manuscript. Authorship must be limited to those who have contributed substantially to the work reported. FundingThis research received no external funding. Conflicts of InterestThe authors declare no conflict of interest. References
Figure 1. A simple taxonomy of variants of k-means algorithm. Figure 1. A simple taxonomy of variants of k-means algorithm. Figure 2. Refined initial instances, adapted from []. Figure 2. Refined initial instances, adapted from []. Figure 3. Performance of k-means variants addressing initialization issue. Figure 3. Performance of k-means variants addressing initialization issue. Figure 4. Performance of k-means variants addressing mixed data problem. Figure 4. Performance of k-means variants addressing mixed data problem. Table 1. Applications of variants of the k-means algorithm in different application domains. Table 1. Applications of variants of the k-means algorithm in different application domains. ReferenceApplicationAlgorithm[]Face detectionSymmetry-based version of k-means (SBKM).[]Mobile storage positioningPotential k-means.[]Load patternHierarchical k-means (H-Kmeans).[]Wireless sensor networksDistributed k-means and fuzzy c-means.[]Partial multiview dataWeighted k-means.[]Mobile healthk-means implemented with CORDIC.[]Endpoint detectionk-means for realtime detection.[]Big dataPrivacy preserving k-means.[]Multiview datak-means.[]Wind power forecastingk-means with bagging neural network.[]Social tagsk-means based on latent semantic analysis.[]Sensing for IGBT currentk-means with neural network.[]Image segmentation.Kernel k-means Nystrom approximation.[]Image compressionk-means cuckoo optimization.[]Sound source angle estimation.Neural network based on global k-means.[]Shape recognitionFuzzy k-means clustering ensemble (FKMCE).[]Signal processingCompressive k-means clustering (CKM).[]Text processingVanilla k-means.[]High dimensional data processingFast adaptive k-means (FAKM).[]Computational complexityMultiple kernel k-means with late fusion.[]Image processingA hybrid parallelization of k-means algorithm.[]Adaptive clusteringFuzzy k-means with S-distance.[]DDoS detectionSemi-supervised k-means algorithm with hybrid feature.[]OptimizationNon alternating stochastic k-means.[]Data SummarizationModified x-means. Table 2. Comparison with existing surveys. Table 2. Comparison with existing surveys. SurveyInitializationData TypesApplicationsExperimentsYang []✓×××Filippone []✓×××Rai []✓×××This paper✓✓✓✓ Table 3. Summary of datasets. Table 3. Summary of datasets. DatasetSummaryCleveland Heart DiseaseWidely used by machine learning researchers. The goal is to detect the presence of heart disease in a patient.KDD-Cup 1999(10%)Contains standard network traffic that contains different types of cyber attacks simulated in a military network.Wisconsin Diagnostic Breast CancerIncludes features calculated from the images of fine needle aspirate of breast mass.Epileptic Seizure RecognitionCommonly used for feature epileptic seizure prediction.Credit ApprovalContains a mix of attributes, which makes it interesting to be used with k-means for mixed attributes.PostoperativeContains both categorical and integer values. The missing values are replaced with an average. Table 4. Comparison of different algorithms addressing initialization issue. Table 4. Comparison of different algorithms addressing initialization issue. Metrick-meansConstrained k-meansx-meansWisconsin Diagnostic Breast CancerAccuracy 0.223 ± 0.310 0.596 ± 0.406 0.086 ± 0.042 ARI 0.690 ± 0.134 0.682 ± 0.13 0.683 ± 0.128 KDD Cup 1999Accuracy 0.195 ± 0.077 0.118 ± 0.087 0.045 ± 0.034 ARI 0.004 ± 0.007 0.107 ± 0.059 0.085 ± 0.169 Epileptic SeizureAccuracy 0.101 ± 0.060 0.099 ± 0.012 0.102 ± 0.053 ARI 0.005 ± 0.001 0.002 ± 0.001 0.002 ± 0.002 Table 5. Comparison of k-prototype and Kernel-k-means algorithm datasets with mixed data types. Table 5. Comparison of k-prototype and Kernel-k-means algorithm datasets with mixed data types. Metrick-prototypeKernel-k-meansCredit ApprovalAccuracy 0.456 ± 0.061 0.437 ± 0.283 ARI 0.004 ± 0.005 0.044 ± 0.092 Cleveland Heart DiseaseAccuracy 0.462 ± 0.043 0.590 ± 0.080 ARI 0.003 ± 0.001 0.017 ± 0.041 Post OperativeAccuracy 0.462 ± 0.043 0.590 ± 0.080 ARI 0.003 ± 0.001 0.017 ± 0.041 Table 6. Complexities of different variants of k-means. Table 6. Complexities of different variants of k-means. Complexityk-meansConstrained-k-meansx-meansTime O ( n 2 ) O ( k n ) O ( n log k m a x ) Space O ( ( n + k ) d ) O ( ( n + k ) d ) O ( ( n + k ) d ) © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Share and CiteMDPI and ACS Style Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. https://doi.org/10.3390/electronics9081295 AMA Style Ahmed M, Seraj R, Islam SMS. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics. 2020; 9(8):1295. https://doi.org/10.3390/electronics9081295 Chicago/Turabian Style Ahmed, Mohiuddin, Raihan Seraj, and Syed Mohammed Shamsul Islam. 2020. "The k-means Algorithm: A Comprehensive Survey and Performance Evaluation" Electronics 9, no. 8: 1295. https://doi.org/10.3390/electronics9081295 Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here. How would you describe kK-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster. The goal is to identify the K number of groups in the dataset. What is the conclusion of K clustering?Stopping Criteria for K-Means Clustering Even after multiple iterations, if the obtained centroids are same for all the clusters, it can be concluded that the algorithm is not learning any new pattern and gives a sign to stop its execution/training to a dataset. How do you evaluate kThe average of the distance between data points and their cluster centroids can be used to evaluate the performance of a k-means clustering model in the absence of ground truth. The within-cluster sum of squares (WCSS), often known as this metric, is used to establish the ideal number of clusters in the dataset. How do you interpret the results of kInterpreting the meaning of k-means clusters boils down to characterizing the clusters. A Parallel Coordinates Plot allows us to see how individual data points sit across all variables. By looking at how the values for each variable compare across clusters, we can get a sense of what each cluster represents. |