Number of Clusters in K-Means Clustering

Words: 2475 Pages: 9

Table of Contents

Abstract
Introduction
Methodology
Strengths of K-Means clustering
Weaknesses K-Means clustering
Conclusion
Recommendation
Works Cited

Abstract

This paper investigates the number of Clusters in K-Means Clustering and analyzes the methods used in numbering. The purpose is to obtain the correct number of clusters in a given document with the least number of errors and achieve an accurate number of clusters. A comparison of the different methods is, therefore, necessary to establish the most accurate method for obtaining the correct number of clusters. By analyzing studies conducted by various researchers this paper will gain more insights on the methods as well as strengths and weaknesses. The numbers of Clusters in K-Means have clearly defined variables that do not overlap. The clusters must have at least one item and items must not overlap in the clusters. Higher numbers of the clusters favor accuracy since the features are narrowed down to specific ones. Data may be clustered using the features of the items. This analysis also shows the strengths and weaknesses of the number of Clusters generated in K-Means clustering.

Introduction

Clustering is a widespread practice in science to disseminate knowledge understandably. In most fields of study, clustering is effective in making comprehensive data manageable. How to form clusters is based on different approaches. The different approaches lie in the debate of what number of clusters belong to the same group. The problem in clustering is that the accuracy may be compromised, or a lot of errors may have to be tolerated (Davies and Bouldin 226). There are various ways of obtaining the number of clusters among them is k- mean clustering. The k- mean clustering is not exceptional and the users must declare the number of clusters (Boutsinas et al 143). K- Mean clustering is a method whose purpose is to group data into a certain number depending on specific features. He et al (1) say that clustering unveils more information and portrays more features. K- Mean clustering, therefore, utilizes the average of items in a cluster to obtain the number of clusters. This research paper will identify the different methods that have been applied in the attempt to obtain the number of Clusters in K-Means Clustering while explaining how they are exploited. Moreover, the paper will also evaluate the strengths and the weaknesses of obtaining the number of clusters using K-Means clustering.

Methodology

A variety of researchers have noted the challenges in the number of Clusters in K-Means Clustering. The different criteria and methods used in choosing the number of clusters in K-Means Clustering have been documented by various scholars before empirical investigation. The documented concerning methods of obtaining the number of Clusters in K-Means Clustering will enlighten the investigation on the numbering of Clusters in K-Means clustering. The information presented in this report has been gathered from secondary sources.

Number of Clusters in K-Means

The number of clusters is dependent on the method used. The methods of clusters are discussed below.

Methods of clustering

Salvador & Chan (574) notes that K-Means clustering generates points from the middle of each cluster. The middle point is considered the average of the cluster. The accuracy of K-Means clustering is reflected in the number of clusters. Thus, the higher the number of clusters, the more accurate it is likely to be. Numbering is based on the kind of data and the size of the document.

Information criteria

Information criterion can be enlightening on the number of clusters as revealed by Lim & Lee (935). The number of clusters can be determined by considering the information obtained and that which may have been lost during the collection of data. This method assumes that some information may have been lost in the process of obtaining data. Clusters are formed after considering the features of each cluster in relation to the available data.

Thumb

This method is widely considered although the accuracy is narrow and it may fail to fit in some situations. It is simple and applicable in choosing a determinant. One of the thumb’s strongholds is the ability to regulate and show change at a relatively constant rate. The number of K-Means clustering is obtained by obtaining the square root of half the number of items.

Text

There are five methods one can adopt to obtain the number of the cluster according to Salvador & Chan (579). They include a cross-validation method whose partitioning attempts to make the items not overlap but remain in a single cluster. The second is an estimation method that reduces the chances of probability. The other method obtains the number of a cluster by calculating approximation error and using the results to determine a time in a series. In addition, Bezdek & Pal (301) identify resampling as another method that is similar to consensus clustering where the users engage in grouping while paying attention to the groups that specific items fit.

Williams (1) notes that the number of clusters can be determined depending on the size of the document. The clusters must be proportional to the size of the entire document. Additionally, there are different methods of grouping results in different clusters. Thus, many clusters imply that the constituents of each cluster are few. The cluster consistency is dependent on the method used in numbering the clusters.

In a K-Means Clustering, each cluster must have a minimum of one item. There are always a number of clusters which is denoted as k. Usually, items do not overlap and in most cases, hierarchical order is not used. Items in a cluster closely resemble those in the same cluster (Boutsinas et al 143).

Cutting et al (319) mention that clustering should be guided by the need to make access to data easy. Partitioning of the data must therefore consider the number of clusters and the sub-divisions within each partitioning. This will enable the user to effectively gain access to appropriate data as required. The number of clusters will also influence the size of the data, hence, will affect the accuracy.

Determining the number of clusters can be guided by the type of data as propounded by Kaufman & Rousseuw (4). The clusters can be based on intervals where the items belong to an interval. They can be grouped based on dissimilarities which focus on how apart two items are. Alternatively, Pal & Bezdek (370) mention that grouping can also be based on similarities which will consider the resemblance of the items. There are cases where the data is best grouped in a binary system. Ratios can be effective in some data types. A mixed item is another cluster that can be used.

Steinley & Brusco (1) argues that the relevance of clustering is to obtain accurate results using the right number of clusters. The errors can be reduced by categorizing each item in its cluster. Many clusters may apply as the features do not overlap. Boutsinas et al (143) argue that broad clusters may have increased overlaps which compromise the accuracy of the numbers. They also note that overlapping may be inevitable and thus the user must be aware of situations that may result in overlaps to consider using other methods of clustering.

Chiang and Mirkin (6) argue that the number of clusters may be obtained but the results may be different since the cluster may not show results that are expected. An error may occur when finding the probable number of clusters, hence sabotaging the process.

Variance

K-Means clustering can be obtained from applying variance. The variance must be converted into a percentage. The use of variance identifies the first cluster whereas the second cluster indicates a sharp increase when the percentage variance is plotted on a graph. The consequent points stop increasing the point at which the number of clusters is chosen after the graph indicates an elbow. This is effective in choosing groups for groups that vary.

Cross-validation

After the selection of several clusters, they can be tested. In line with Jain & Dubes (4), the document is given different sets where each is tested separately for their value. The value obtained is used to generate the average for each set. The results determine what sets have the least chance of causing errors. The clusters are chosen based on their ability to create an accurate number of clusters with minimal errors.

Silhouette

This method does relate the relationship of the items in one cluster and the relationship of the items in another cluster. Thus, it denotes clusters the natural cluster. If the mean of the cluster is close to 1, then the number of clusters is correct. A cluster nearing -1 is weak and is considered incorrect. Thus, high values of the mean are chosen as the most accurate (Jain & Dubes 4).

Use of topics and content

A general approach of numbering clusters is by finding the general topics together with the kind of content that is included (Ray & Turi 1). This method enables the retrieval of data from a comprehensive document simple and saves time as pointed out by Liu & Gong (191). This may be effective considering text data can be enormous. Pal& Pal (277) add that use of themes may also be an alternative to the use of keywords which in most cases is limiting to people who are not conversant with keywords, or the terminology used in different studies.

Hierarchical clustering

Boutsinas et al (143) indicate that hierarchical clustering is one of the methods that can be used for clustering. The means are arranged in a hierarchical order so that groups can be deduced from them. Hierarchy can apply in some cases but not in many. This is because determining the level at which the hierarchy is to be split can be a daunting task. Moreover, hierarchical clustering is not suitable for large amounts of data that may have further divisions under a classification. In line with Liu & Yang (689) hierarchical clustering will benefit people interested in general information.

Evolutionary algorithm

Lu and Traore (1) suggest that an evolutionary algorithm is the best way of numbering the Clusters in K-Means. This method is believed to have the potential of lowering the levels of error. It incorporates the use of GMM and Expectation-Maximization algorithm to obtain the number of clusters.

Liu & Gong (191) propose a three-step method that is intended to be effective. The first step involves setting features that are reflective of the document. The frequent words that are unique to the document are considered. The criterion is based on the number of times the unique words appear. This is not effective since the words may not be purely unique. More so, the same repeated unique words such as names of people or places may also belong to different contexts in diverse topics thus it may be misleading. To overcome the weak observation with the unique words, the significant words, terminologies, and paired words are used. The next step is to use the GMM (Gaussian Mixture Model) together with EM (Expectation-Maximization) algorithm. A shortcoming of the GMM and EM algorithm is that members of the same cluster are treated similarly.

Strengths of K-Means clustering

K-Means clustering is a method that reduces errors for its use of the squared sum of error. Kim & Yamashita (70) mentions that the numbers obtained from K-Means clustering are effective in analyzing documents that have spatial information. One of the most effective uses of the method is in the analysis of traffic. It has been used in traffic safety in research. Besides being a useful method that has been used for spatial data, Tibshirani & Walther (525) add that the numbering in K-Means clustering has superior predictive ability. They reveal that real notions are utilized to omit biases and the unlabelled information is seen as inconsistent. Thus, due to its predictive ability, the numbering method has been adopted in the research of cancer where the grouping is based on K-Means clustering. Furthermore, K-Means clustering has consistent methods of numbering.

Weaknesses K-Means clustering

K-Means clustering is an effective method that is limited to relatively manageable amounts of data. Large documents may be complicated since the method is simplistic. The method is limited by the initial selection of a cluster. This may introduce inaccuracy and diverse results.

Magidson & Vermunt (38) indicates that k-mean clustering numbering is predetermined, and this introduces bias in the numbering. They propose that the user considers using probability methods that avoid misplacing the cluster. There lacks of assistance in the method of numbering clusters in k-means. No diagnostic statistical method is employed hence the numbering may fail to be accurate.

When numbering the clusters in K-mean, the user must define the standard of the variable to ensure that there are no overlaps. Magidson & Vermunt (38) indicate that standardization is limiting because the variation in the standard is unknown. This is done in an attempt to distribute the items in the clusters.

Conclusion

Various scholars have propounded various methods of clustering all of which are diverse in the selection of the number of clusters. The difference exists because of the nature of the data and the comprehensiveness of the document. Some methods are suitable for certain documents depending on the purpose. The number of Clusters in K-Means clustering is dependable and consistent. K-Means clustering ensures that each cluster has at least one item and does not allow overlapping. The clustering methods include the Information criteria, Thumb, Text, Variance, Cross-validation, Silhouette, Use of topics and content, Hierarchical clustering, and Evolutionary algorithm. Obtaining the number of Clusters in K-Means clustering reduces the level of error by the use of the squared sum of error. It is also effective in documents that have spatial data. Acquiring the Number of Clusters in K-Means clustering may be limiting when large amounts are in to be clustered. This algorithm has been successfully used in the determination of the safety of roads as well as in the research of cancer and has predictive capability.

Recommendation

The Number of Clusters in K-Means Clustering is guided by the necessity of obtaining the right numbers that will yield accurate results with minimal errors. It is crucial to note what kind of data is to be clustered and select a suitable way to number the Clusters using K-Means Clustering.

According to Magidson & Vermunt (38), biases in finding the number of clusters can be reduced by increasing the use of probability methods to avoid misplacing the clusters. In addition, they propose that diagnostic statistical programs can be used too to determine the number of clusters. When it comes to standardization of variables the user may choose to avoid standardization that may lead to crowding of items in a certain cluster. Alternatively, the variables can be used the unchanging variables.

Treating the members of the same cluster can be avoided. Thus, it is necessary to form clusters basses on features and obtain the possible number of clusters that have the least error possible. Reviewing the data before choosing the clustering methods is essential. Moreover, the numbers of clusters from K-Means clustering are reliable.

Works Cited

Bezdek, James. & Pal, Nikhil. Some new indexes of cluster validity, IEEE Trans, Cybern, 1998: 28, Pp. 301-315.

Boutsinas, Basilis., Tasoulis, Dimitris, and Vrahatis, Michael. Estimating the Number of clusters using a windowing technique. Journal of Pattern Recognition an Image Analysis, 2006: 16, (2) 143-154.

Chiang, Mai and Mirkin, Boris. Intelligent choice of the number of clusters in K-means Clustering: An experimental study with different cluster spreads. Journal of Classification, 2006: 27(1) 3-40.

Cutting, Douglas., Karger, David., Pedersen, Jan and Tukey, John. Scatter/Gather: A cluster-based approach to browsing large document collections, ACM SIGIR, 1992: 92, Pp. 318-329.

Davies, David and Bouldin, Davies. A cluster separation measure, IEEE Trans. Pattern Anal. Machine Intell, 1979: Vol. 1, pp. 224-227, 1979.

Jain, Anil and Dubes, Richard. Algorithms for Clustering Data. New Jersey: Prentice Hall, 1988.

Kaufman, Leonard & Rousseuw, Peter. Finding Groups in Data an Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics. New York: A Wiley-Interscience Publication, 1990.

Kim, Karl & Yamashita, Eric. Using a k-means clustering algorithm to examine Patterns of pedestrian involved crashes in Honolulu, Hawaii. Journal of Advanced Transportation, 2007: 41, (1) 69–89

Liu, Xin & Gong, Yihong. Document clustering with cluster refinement and model Selection capabilities. In Proc. of ACM SIGIR, 2002, Pp 191-198.

Liu, Jie. & Yang, Yihong. Multiresolution color image segmentation. IEEE Trans, (1994) 689-700.

Lim, Yee-Wei. & Lee, Su Kim. On the color image segmentation algorithm based on The thresh holding and the fuzzy c-means techniques, Pattern Recognition, (1990) 935-952.

Lu, Wei. & Traore, Issa. Determining the optimal number of clusters using a new Evolutionary algorithm. US: In Proc. Of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 05), 2005.

Magidson, Jay & Vermunt, Jeroen. Latent class models for clustering: A comparison with K-means. Canadian journal of marketing research, (2002) 37- 44.

Pal, Nikhil & Bezdek, James. On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Systems, 1995, vol. 3 (1995): 370-379.

Pal, Nikhil & Pal, Sankar. A review on image segmentation techniques. Pattern Recognition, (1998) 1277-1294.

Ray, Siddheswar. & Turi, Rose. Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation, 1999.

Salvador, Stan. and Chan, Philip. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proc. Of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 04), 2004: 576-584.

Steinley, Doug. & Brusco, Michael. Choosing the number of clusters in K-means clustering. Psychol Methods, 2011.

Tibshirani, Robert. & Walther, Guenther. Cluster Validation by Prediction Strength. Journal of Computational and Graphical Statistics, 2005: 14, (3) 511-528.

Williams, Graham. Number of clusters, 2010.