×

Resampling approach for cluster model selection. (English) Zbl 1237.62084

Summary: In cluster analysis, selecting the number of clusters is an “ill-posed” problem of crucial importance. We propose a re-sampling method for assessing cluster stability. Our model suggests that samples’ occurrences in clusters can be considered as realizations of the same random variable in the case of the “true” number of clusters. Thus, similarity between different cluster solutions is measured by means of compound and simple probability metrics. Compound criteria result in validation rules employing the stability content of clusters. Simple probability metrics, in particular those based on kernels, provide more flexible geometrical criteria. We analyze several applications of probability metrics combined with methods intended to simulate cluster occurrences. Numerical experiments are provided to demonstrate and compare the different metrics and simulation approaches.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
65C60 Computational problems in statistics (MSC2010)

Software:

KernSmooth
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aizerman, M. A., Braverman, E. M., & Rozono, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. · Zbl 0151.24701
[2] Anderson, N. H., Hall, P., & Titterington, M. (1994). Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50(1), 41-54. · Zbl 0798.62055 · doi:10.1006/jmva.1994.1033
[3] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68. · Zbl 0037.20701
[4] Baringhaus, L., & Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis, 88(1), 190-206. · Zbl 1035.62052 · doi:10.1016/S0047-259X(03)00079-4
[5] Barzily, Z., Volkovich, Z., Akteke-Ozturk, B., & Weber, G.-W. (2009). On a minimal spanning tree approach in the cluster validation problem. Informatica, 20(2), 187-202. · Zbl 1194.68199
[6] Belopolskaya, Ya., Klebanov, L., & Volkovich, V. (2005). Characterization of elliptic distributions. Journal of Mathematical Sciences, 127(1), 1682-1686. · Zbl 1120.62034 · doi:10.1007/s10958-005-0128-9
[7] Ben-Hur, A.; Guyon, I.; Brownstein, M. J. (ed.); Khodursky, A. (ed.), Detecting stable clusters using principal component analysis, 159-182 (2003), Clifton
[8] Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125-137. · Zbl 1002.68598
[9] Ben-Hur, A.; Elisseeff, A.; Guyon, I., A stability based method for discovering structure in clustered data, 6-17 (2002)
[10] Berg, C., Christensen, J. P. R., & Ressel, P. (1984). Harmonic analysis on semigroups. Berlin: Springer. · Zbl 0619.43001 · doi:10.1007/978-1-4612-1128-0
[11] Breckenridge, J. (1989). Replicating cluster analysis: method, consistency and validity. Multivariate Behavioral Research, 24, 147-161. · doi:10.1207/s15327906mbr2402_1
[12] Calinski, R., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1-27. · Zbl 0273.62010 · doi:10.1080/03610927408827101
[13] Celeux, G., & Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 14(3), 15, 315-332. · Zbl 0937.62605 · doi:10.1016/0167-9473(92)90042-E
[14] Chakravarthy, S. V., & Ghosh, J. (1996). Scale-based clustering using the radial basis function network. IEEE Transactions on Neural Networks, 7(5), 1250-1261. · doi:10.1109/72.536318
[15] Cheng, R., & Milligan, G. W. (1996). Measuring the influence of individual data points in a cluster analysis. Journal of Classification, 13, 315-335. · Zbl 1059.62554 · doi:10.1007/BF01246105
[16] Conover, W. J., Johnson, M. E., & Johnson, M. M. (1981). Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23, 351-361. · doi:10.2307/1268225
[17] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 273-297. · Zbl 0831.68098
[18] Cuevas, A., Febrero, M., & Fraiman, R. (2000). Estimating the number of clusters. Canadian Journal of Statistics, 28(2), 367-382. · Zbl 0981.62054 · doi:10.2307/3315985
[19] Cuevas, A., Febrero, M., & Fraiman, R. (2001). Cluster analysis: a further approach based on density estimation. Computational Statistics & Data Analysis, 28, 441-459. · Zbl 1053.62537 · doi:10.1016/S0167-9473(00)00052-9
[20] Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1), 143-175. Also appears as IBM Research Report RJ 10147, July 1999. · Zbl 0970.68167 · doi:10.1023/A:1007612920971
[21] Dhillon, I.; Kogan, J.; Nicholas, C., Feature selection and document clustering, 73-100 (2003), Berlin
[22] Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7), 0036. · doi:10.1186/gb-2002-3-7-research0036
[23] Dunn, J. C. (1974). Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4, 95-104. · Zbl 0304.68093 · doi:10.1080/01969727408546059
[24] Duran, B. S. (1976). A survey of nonparametric tests for scale. Communications in Statistics. Theory and Methods, 5, 1287-1312. · Zbl 0362.62044 · doi:10.1080/03610927608827443
[25] Feng, Y.; Hamerly, G., PG-means: learning the number of clusters in data (2006)
[26] Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176-190. · Zbl 1122.68530 · doi:10.1016/j.patcog.2007.05.018
[27] Friedman, J. H., & Rafsky, L. C. (1979). Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Annals of Statistics, 7, 697-717. · Zbl 0423.62034 · doi:10.1214/aos/1176344722
[28] Girolami, M. (2002). Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3), 780-784. · doi:10.1109/TNN.2002.1000150
[29] Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 158-171. · doi:10.1109/34.982897
[30] Gordon, A. D. (1994). Identifying genuine clusters in a classification. Computational Statistics & Data Analysis, 18, 561-581. · Zbl 0900.62311 · doi:10.1016/0167-9473(94)90085-X
[31] Gordon, A. D. (1999). Classification. Boca Raton: Chapman and Hall/CRC. · Zbl 0929.62068
[32] Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A., A kernel method for the two-sample-problem, No. 19, 513-520 (2007), Cambridge
[33] Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A., A kernel approach to comparing distributions, 1637-1641 (2007)
[34] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. J. (2008a). A kernel method for the two-sample problem. CoRR. arXiv:0805.2368. DBLP, http://dblp.uni-trier.de. · Zbl 1283.62095
[35] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2008b). A kernel method for the two-sample problem. Journal of Machine Learning Research, 4, 1-10.
[36] Hall, P., & Tajvidi, N. (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2), 359-374. · Zbl 1017.62040 · doi:10.1093/biomet/89.2.359
[37] Hamerly, G.; Elkan, Ch., Learning the k in k-means, 281-288 (2003)
[38] Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley. · Zbl 0372.62040
[39] Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76, 388-394. · Zbl 0468.62053 · doi:10.2307/2287840
[40] Hartigan, J. A. (1985). Statistical theory in clustering. Journal of Classification, 2, 63-76. · Zbl 0575.62058 · doi:10.1007/BF01908064
[41] Haussler, D. (1999). Convolution kernels on discrete structures (UCSC-CRL-9910). Department of Computer Science University of California at Santa Cruz.
[42] Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. Annals of Statistics, 16, 772-783. · Zbl 0645.62062 · doi:10.1214/aos/1176350835
[43] Hubert, L., & Schultz, J. (1976). Quadratic assignment as a general data-analysis strategy. British Journal of Mathematical & Statistical Psychology, 76, 190-241. · Zbl 0356.92027 · doi:10.1111/j.2044-8317.1976.tb00714.x
[44] Jain, A., & Dubes, R. (1988). Algorithms for clustering data. New Jersey: Englewood Cliffs/Prentice-Hall. · Zbl 0665.62061
[45] Jain, A. K., & Moreau, J. V. (1987). Bootstrap technique in cluster analysis. Pattern Recognition, 20(5), 547-568. · doi:10.1016/0031-3203(87)90081-1
[46] Kass, R. E. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928-934. · Zbl 0851.62020 · doi:10.2307/2291327
[47] Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. New York: Wiley. · Zbl 1345.62009 · doi:10.1002/9780470316801
[48] Klebanov, L. (2003). One class of distribution free multivariate tests. SPb. Math. Society, Preprint, 03.
[49] Klebanov, L. B. (2005). N-distances and their applications. Charsel University in Prague, The Karolinum Press.
[50] Klebanov, L., Kozubowskii, T., Rachev, S., & Volkovich, V. (2001). Characterization of distributions symmetric with respect to a group of transformations and testing of corresponding statistical hypothesis. Statistics & Probability Letters, 53, 241-247. · Zbl 0982.62054 · doi:10.1016/S0167-7152(01)00011-6
[51] Kogan, J., Nicholas, C., & Volkovich, V. (2003a). Text mining with information-theoretical clustering. Computing in Science and Engineering, 52-59.
[52] Kogan, J.; Nicholas, C.; Volkovich, V.; Berry, M. W. (ed.); Pottenger, W. M. (ed.), Text mining with hybrid clustering schemes, 5-16 (2003)
[53] Kogan, J.; Teboulle, M.; Nicholas, C., Optimization approach to generating families of k-means like algorithms (2003)
[54] Krzanowski, W., & Lai, Y. (1985). A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics, 44, 23-34. · Zbl 0707.62122 · doi:10.2307/2531893
[55] Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83-97. · Zbl 0143.41905 · doi:10.1002/nav.3800020109
[56] Lange, T., Braun, M., Roth, V., & Buhmann, J. M. (2003). Stability-based model selection. Advances in Neural Information Processing Systems, 15. http://citeseer.ist.psu.edu/700728.html. · Zbl 1013.68941
[57] Lange, T., Roth, V., Braun, M., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 15(6), 1299-1323. · Zbl 1089.68100 · doi:10.1162/089976604773717621
[58] Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573-2593. · Zbl 0993.68113 · doi:10.1162/089976601753196030
[59] Lukacs, E. (1970). Characteristic functions. Duxbury: Griffin. · Zbl 0201.20404
[60] Milligan, G., & Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159-179. · doi:10.1007/BF02294245
[61] Mufti, G. B.; Bertrand, P.; Moubarki, L., Determining the number of groups from measures of cluster validity, 404-414 (2005)
[62] Parzen, E. (1962). On the estimation of a probability density function and the mode. Annals of Mathematical Statistics, 32, 1065-1076. · Zbl 0116.11302 · doi:10.1214/aoms/1177704472
[63] Pascual, D., Pla, F., & Sanche, J. S. (2010). Cluster validation using information stability measures. Pattern Recognition Letters, 31, 454-461. · doi:10.1016/j.patrec.2009.07.009
[64] Pelleg, D.; Moore, A., X-means: Extending K-means with efficient estimation of the number of clusters, 727-734 (2000), San Francisco
[65] Principe, J.; Xu, D.; Fisher, J., Information theoretic learning, 265-319 (2002), New York
[66] Rachev, S. T. (1991). Wiley series in probability and mathematical statistics. Probability metrics and the stability of stochastic models. Chichester: Wiley. · Zbl 0744.60004
[67] Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8), 945-948. · doi:10.1103/PhysRevLett.65.945
[68] Rose, K., Gurewitz, E., & Fox, G. C. (1993). Constrained clustering as an optimization method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(8), 785-794. · doi:10.1109/34.236251
[69] Robert, J., & Torbjorn, E. (2008). A new information theoretic analysis of sum-of-squared-error kernel clustering. Neurocomputing, 72(1-3), 23-31.
[70] Rosenbaum, P. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 67(4), 515-530. · Zbl 1095.62053 · doi:10.1111/j.1467-9868.2005.00513.x
[71] Roth, V., Lange, T., Braun, M., & Buhmann, J. (2002). A resampling approach to cluster validation. COMPSTAT, available at http://www.cs.uni-bonn.De/braunm. · Zbl 1439.62034
[72] Sato, K. (1999). Levy processes and infinitely divisible distributions. Cambridge: Cambridge University Press. · Zbl 0973.60001
[73] Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3), 522-536. · JFM 64.0617.02 · doi:10.1090/S0002-9947-1938-1501980-0
[74] Schölkopf, B., The kernel trick for distances, 301-307 (2000)
[75] Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. New York: MIT Press.
[76] Schölkopf, B., Smola, A. J., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319. · doi:10.1162/089976698300017467
[77] Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67-93. · Zbl 1009.68143
[78] Still, S., & Bialek, W. (2004). How many clusters? An information-theoretic perspective. Neural Computation, 16(12), 2483-2506. · Zbl 1062.82045 · doi:10.1162/0899766042321751
[79] Strehl, A., & Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583-617. · Zbl 1084.68759
[80] Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification, 20(5), 25-47. · Zbl 1055.62075 · doi:10.1007/s00357-003-0004-6
[81] Sugar, C., & James, G. (2003). Finding the number of clusters in a data set: an information theoretic approach. Journal of the American Statistical Association, 98, 750-763. · Zbl 1046.62064 · doi:10.1198/016214503000000666
[82] Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528. · doi:10.1198/106186005X59243
[83] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters via the gap statistic. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 63(2), 411-423. · Zbl 0979.62046 · doi:10.1111/1467-9868.00293
[84] Tishby, N.; Pereira, F. C.; Bialek, W.; Hajek, B. (ed.); Sreenivas, R. S. (ed.), The information bottleneck method, 368-377 (2000)
[85] Volkovich, Z.; Barzily, Z., On application of probability metrics in the cluster stability problem, Lisbon, Portugal
[86] Volkovich, V.; Kogan, J.; Nicholas, C.; Dhillon, I. (ed.); Kogan, J. (ed.), k-means initialization by sampling large datasets, 17-22 (2004)
[87] Volkovich, Z., Barzily, Z., & Sureanu, P. (2005). The Levy-Khinchine representations and functional algebras of test functions. Journal of Pure and Applied Mathematics, 25(1), 103-121. · Zbl 1095.60009
[88] Volkovich, Z., Barzily, Z., & Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174-2188. · Zbl 1138.68519 · doi:10.1016/j.patcog.2008.01.008
[89] Volkovich, Z.; Barzily, Z.; Avros, R.; Toledano-Kitai, D., On application of the K-nearest neighbors approach for cluster validation, Vilnius, Lithuania · Zbl 1237.62084
[90] Volkovich, Z.; Barzily, Z.; Weber, G.-W.; Toledano-Kitai, D., Cluster stability estimation based on a minimal spanning trees approach, Bali, Indonesia · Zbl 1237.62084
[91] Volkovich, Z.; Weber, G.-W.; Avros, R., On an adjacency cluster merit, Gold Coast, Australia
[92] Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London: Chapman and Hall. · Zbl 0854.62043
[93] Wishart, D.; Cole, A. J. (ed.), Mode analysis: a generalization of nearest neighbor which reduces chaining effects, No. 76, 282-311 (1969), London
[94] Zech, G., & Aslan, B. (2005). New test for the multivariate two-sample problem based on the concept of minimum energy. Journal of Statistical Computation and Simulation, 75(2), 109-119. · Zbl 1096.62037 · doi:10.1080/00949650410001661440
[95] Zinger, A. A., Kakosyan, A. V., & Klebanov, L. B. (1989). Characterization of distributions by means of the mean values of statistics in connection with some probability metrics. Stability Problems for Stochastic Models, VNIISI, 47-55.
[96] Zolotarev, V. M. (1997). Modern theory of summation of random variable. Leiden: Brill Academic. · Zbl 0907.60001 · doi:10.1515/9783110936537
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.