×

Building initial partitions through sampling techniques. (English) Zbl 1135.90042

Summary: A variety of iterative clustering algorithms require an initial partition of a dataset as an input parameter. As a rule a good choice of the initial partition is essential for building a high quality final partition. In this note, we generate initial partitions by using small samples of the data. Numerical experiments with k-means like clustering algorithms are reported.

MSC:

90C27 Combinatorial optimization
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Forgy, E., Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications, Biometrics, 21, 3, 768 (1965)
[2] P. Berkhin, J.D. Becher, Learning simple relations: Theory and applications, in: Proceedings of the Second SIAM International Conference on Data Mining, 2002, 420-436.; P. Berkhin, J.D. Becher, Learning simple relations: Theory and applications, in: Proceedings of the Second SIAM International Conference on Data Mining, 2002, 420-436.
[3] Dhillon, I. S.; Modha, D. S., Concept decompositions for large sparse text data using clustering, Machine Learning, 42, 1, 143-175 (2001) · Zbl 0970.68167
[4] Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, Enhanced word clustering for hierarchical text classification, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002, pp. 191-200.; Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, Enhanced word clustering for hierarchical text classification, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002, pp. 191-200. · Zbl 1102.68545
[5] J. Kogan, M. Teboulle, C. Nicholas, The entropic geometric means algorithm: An approach for building small clusters for large text datasets, in: D. Boley et al. (Ed.), Proceedings of the Workshop on Clustering Large Data Sets (held in conjunction with the Third IEEE International Conference on Data Mining), 2003, pp. 63-71.; J. Kogan, M. Teboulle, C. Nicholas, The entropic geometric means algorithm: An approach for building small clusters for large text datasets, in: D. Boley et al. (Ed.), Proceedings of the Workshop on Clustering Large Data Sets (held in conjunction with the Third IEEE International Conference on Data Mining), 2003, pp. 63-71.
[6] Kogan, J.; Teboulle, M.; Nicholas, C., Data driven similarity measures for \(k\)-means like clustering algorithms, Information Retrieval, 8, 331-349 (2005)
[7] Banerjee, A.; Merugu, S.; Dhillon, I. S.; Ghosh, J., Clustering with Bregman divergences, (Proceedings of the 2004 SIAM International Conference on Data Mining (2004), SIAM), 234-245 · Zbl 1190.62117
[8] Sugar, C. A.; Gareth, J. M., Finding the number of clusters in a dataset: An information-theoretic approach, Journal of the American Statistical Association, 98, 463, 750-763 (2003) · Zbl 1046.62064
[9] Auslender, A.; Teboulle, M.; Ben-Tiba, S., Interior proximal and multiplier methods based on second order homogeneous kernels, Mathematics of Operations Research, 24, 645-668 (1999) · Zbl 1039.90518
[10] Teboulle, M., Entropic proximal mappings with application to nonlinear programming, Mathematics of Operation Research, 17, 670-690 (1992) · Zbl 0766.90071
[11] Teboulle, M., Convergence of proximal-like algorithms, SIAM Journal of Optimization, 7, 1069-1083 (1997) · Zbl 0890.90151
[12] I. Davidson, A. Satyanarayana, Speeding up \(k\); I. Davidson, A. Satyanarayana, Speeding up \(k\)
[13] Epter, S.; Krishnamoorthy, M.; Zaki, M., Clusterability detection and cluster initialization, (Dhillon, I. S.; Kogan, J., Proceedings of the Workshop on Clustering High Dimensional Data and its Applications at the Second SIAM International Conference on Data Mining (2002), SIAM), 47-58
[14] Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern Classification (2000), John Wiley & Sons
[15] Dhillon, I. S.; Guan, Y.; Kogan, J., Refining clusters in high-dimensional text data, (Dhillon, I. S.; Kogan, J., Proceedings of the Workshop on Clustering High Dimensional Data and its Applications at the Second SIAM International Conference on Data Mining (2002), SIAM), 71-82
[16] Dhillon, I. S.; Kogan, J.; Nicholas, C., Feature selection and document clustering, (Berry, M. W., A Comprehensive Survey of Text Mining (2003), Springer-Verlag), 73-100
[17] Teboulle, M., On \(ϕ\)-divergence and its applications, (Phillips, F. Y.; Rousseau, J., Systems and Management Science by Extremal Methods—Research Honoring Abraham Charnes at Age 70 (1992), Kluwer Academic Publishers: Kluwer Academic Publishers Nowell, MA), 255-273
[18] Csiszar, I., Information-type measures of difference of probability distributions and indirect observations, Studia Science Mathematic Hungary, 2, 299-318 (1967) · Zbl 0157.25802
[19] Celeux, G.; Govaert, G., A classification EM algorithm for clustering and two stochastic versions, Computational Statistics and Data Analysis, 14, 315, 315-332 (1992) · Zbl 0937.62605
[20] Fraley, C.; Raftery, A. E., How many clusters? Which clustering method? Answers via model-based cluster analysis, The Computer Journal, 41, 8, 578-588 (1998) · Zbl 0920.68038
[21] Feller, W., An Introduction to Probability Theory and its Applications, 1 (1968), Wiley: Wiley New York · Zbl 0155.23101
[22] Koroluck, V. S.; Portenko, N. I.; Skorochod, A. V.; Turbin, A. F., The Handbook on Probability Theory and Mathematical Statistics (1978), Science: Science Kiev
[23] Dhillon, I. S.; Guan, Y.; Kogan, J., Iterative clustering of high dimensional text data augmented by local search, (Proceedings of the 2002 IEEE International Conference on Data Mining (2002), IEEE Computer Society Press), 131-138
[24] Johnson, N. L.; Kotz, S.; Balakrishnan, N., Continuous Univariate Distributions, vol. 2 (1995), John Wiley: John Wiley New York · Zbl 0821.62001
[25] Porter, M. F., An algorithm for suffix stripping, Program, 14, 130-137 (1980)
[26] E. Chisholm, T. Kolda, New term weighting formulas for the vector space method in information retrieval, Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory, 1999.; E. Chisholm, T. Kolda, New term weighting formulas for the vector space method in information retrieval, Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory, 1999.
[27] Berry, M.; Browne, M., Understanding Search Engines (1999), SIAM · Zbl 0996.68500
[28] Lumelskii, Ya. P., Confidence limits for linear functions of unknown parameters, Theory of Probability and its Applications, 14, 364-367 (1969)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.