×

Fast modified global \(k\)-means algorithm for incremental cluster construction. (English) Zbl 1213.68514

Summary: The \(k\)-means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and are inefficient for solving clustering problems in large datasets. Recently, incremental approaches have been developed to resolve difficulties with the choice of starting points. The global \(k\)-means and the modified global \(k\)-means algorithms are based on such an approach. They iteratively add one cluster center at a time. Numerical experiments show that these algorithms considerably improve the \(k\)-means algorithm. However, they require storing the whole affinity matrix or computing this matrix at each iteration. This makes both algorithms time consuming and memory demanding for clustering even moderately large datasets. In this paper, a new version of the modified global \(k\)-means algorithm is proposed. We introduce an auxiliary cluster function to generate a set of starting points lying in different parts of the dataset. We exploit information gathered in previous iterations of the incremental algorithm to eliminate the need of computing or storing the whole affinity matrix and thereby to reduce computational effort and memory usage. Results of numerical experiments on six standard datasets demonstrate that the new algorithm is more efficient than the global and the modified global \(k\)-means algorithms.

MSC:

68T10 Pattern recognition, speech recognition

Software:

UCI-ml
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bagirov, A. M., Modified global \(k\)-means algorithm for sum-of-squares clustering problem, Pattern Recognition, 41, 3192-3199 (2008) · Zbl 1147.68669
[2] Bagirov, A. M.; Yearwood, J., A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems, European Journal of Operational Research, 170, 578-596 (2006) · Zbl 1085.90045
[3] Hansen, P.; Ngai, E.; Cheung, B. K.; Mladenovic, N., Analysis of global \(k\)-means, an incremental heuristic for minimum sum-of-squares clustering, Journal of Classification, 22, 2, 287-310 (2005) · Zbl 1336.62182
[4] Likas, A.; Vlassis, M.; Verbeek, J., The global \(k\)-means clustering algorithm, Pattern Recognition, 36, 451-461 (2003)
[5] Bagirov, A. M.; Mardaneh, K., Modified global k-means algorithm for clustering in gene expression datasets, (Boden, M.; Bailey, T., Proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics (2006), Hobart: Hobart Australia), 23-28
[6] Agresti, A., Categorical Data Analysis (1990), Wiley: Wiley NY · Zbl 0716.62001
[7] Lai, J. Z.C.; Huang, T.-J., Fast global k-means clustering using cluster membership and inequality, Pattern Recognition, 43, 3, 731-737 (2010)
[8] Fisher, D., Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2, 139-172 (1987)
[9] N. Alex, B. Hammer, Parallelizing single patch pass clustering, in: ESANN, 2008, pp. 227-232.; N. Alex, B. Hammer, Parallelizing single patch pass clustering, in: ESANN, 2008, pp. 227-232.
[10] M. Charikar, L. O’Callaghan, R. Panigrahy, Better streaming algorithms for clustering problems, in: Proceedings of STOC03, 9-11 June 2003, San Diego, California, USA.; M. Charikar, L. O’Callaghan, R. Panigrahy, Better streaming algorithms for clustering problems, in: Proceedings of STOC03, 9-11 June 2003, San Diego, California, USA. · Zbl 1192.68350
[11] Cottrell, M.; Hammer, B.; Hasenfuss, A.; Villmann, T., Batch and median neural gas, Neural Networks, 19, 6-7, 762-771 (2006) · Zbl 1102.68542
[12] Ch. Gupta, R. Grossman, GenIc: a single pass generalized incremental algorithm for clustering, in: Proceedings of the Fourth SIAM International Conference on Data Mining, 2004, pp. 147-153.; Ch. Gupta, R. Grossman, GenIc: a single pass generalized incremental algorithm for clustering, in: Proceedings of the Fourth SIAM International Conference on Data Mining, 2004, pp. 147-153.
[13] Ramasubramanian, B.; Paliwal, K. K., A generalized optimization of the k-d tree for fast nearest neighbour search, (Fourth IEEE Region 10 International Conference (TENCON 89) (1990), IEEE Computer Society Press), 565-568
[14] C. Elkan, Using the triangle inequality to accelerate k-means, in: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), Washington, DC, 2003.; C. Elkan, Using the triangle inequality to accelerate k-means, in: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), Washington, DC, 2003.
[15] Hodgson, M. E., Reducing computational requirements of the minimum distance classifier, Remote Sensing of Environments, 25, 117-128 (1998)
[16] Moore, A. W., The anchors hierarchy: using the triangle inequality to survive high dimensional data, (Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (2000), Morgan Kaufmann), 397-405
[17] Bagirov, A. M.; Rubinov, A. M.; Yearwood, J., A global optimisation approach to classification, Optimization and Engineering, 3, 2, 129-155 (2002) · Zbl 1035.90060
[18] Bagirov, A. M.; Rubinov, A. M.; Soukhoroukova, N. V.; Yearwood, J., Supervised and unsupervised data classification via nonsmooth and global optimization, TOP: Spanish Operations Research Journal, 11, 1, 1-93 (2003) · Zbl 1048.65059
[19] Bock, H. H., Clustering and neural networks, (Rizzi, A.; Vichi, M.; Bock, H. H., Advances in Data Science and Classification (1998), Springer-Verlag: Springer-Verlag Berlin), 265-277 · Zbl 1051.91523
[20] Al-Sultan, K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28, 9, 1443-1451 (1995)
[21] Brown, D. E.; Entail, C. L., A practical application of simulated annealing to the clustering problem, Pattern Recognition, 25, 401-412 (1992)
[22] Diehr, G., Evaluation of a branch and bound algorithm for clustering, SIAM Journal of Scientific and Statistical Computing, 6, 268-284 (1985) · Zbl 0561.65097
[23] du Merle, O.; Hansen, P.; Jaumard, B.; Mladenovic, N., An interior point method for minimum sum-of-squares clustering, SIAM Journal on Scientific Computing, 21, 1485-1505 (2001) · Zbl 1049.90129
[24] Dubes, R.; Jain, A. K., Clustering techniques: the user’s dilemma, Pattern Recognition, 8, 247-260 (1976)
[25] Hanjoul, P.; Peeters, D., A comparison of two dual-based procedures for solving the \(p\)-median problem, European Journal of Operational Research, 20, 387-396 (1985) · Zbl 0565.90011
[26] Hansen, P.; Mladenovic, N., \(J\)-means: a new heuristic for minimum sum-of-squares clustering, Pattern Recognition, 4, 405-413 (2001) · Zbl 1012.68873
[27] Hansen, P.; Mladenovic, N., Variable neighborhood decomposition search, Journal of Heuristic, 7, 335-350 (2001) · Zbl 1041.68623
[28] Koontz, W. L.G.; Narendra, P. M.; Fukunaga, K., A branch and bound clustering algorithm, IEEE Transactions on Computers, 24, 908-915 (1975) · Zbl 0308.68039
[29] Selim, S. Z.; Al-Sultan, K. S., A simulated annealing algorithm for the clustering, Pattern Recognition, 24, 10, 1003-1008 (1991)
[30] Späth, H., Cluster Analysis Algorithms (1980), Ellis Horwood Limited: Ellis Horwood Limited Chichester
[31] Sun, L. X.; Xie, Y. L.; Song, X. H.; Wang, J. H.; Yu, R. Q., Cluster analysis by simulated annealing, Computers and Chemistry, 18, 103-108 (1994) · Zbl 0825.92149
[32] Xavier, A. E., The hyperbolic smoothing clustering method, Pattern Recognition, 43, 3, 731-737 (2010) · Zbl 1187.68514
[33] UCI repository of machine learning databases, \( \langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \); UCI repository of machine learning databases, \( \langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \)
[34] Dunn, J. C., Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, 4, 95-104 (1974) · Zbl 0304.68093
[35] Davies, D. L.; Bouldin, D. W., A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 4, 224-227 (1979)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.