×

Boosting the margin: a new explanation for the effectiveness of voting methods. (English) Zbl 0929.62069

Summary: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. We show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label.
We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
65C60 Computational problems in statistics (MSC2010)
68T05 Learning and adaptive systems in artificial intelligence

Software:

C4.5; AdaBoost.MH; UCI-ml
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] BARRON, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Inform. Theory 39 930 945. · Zbl 0818.68126 · doi:10.1109/18.256500
[2] BARTLETT, P. L. 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525 536. · Zbl 0901.68177 · doi:10.1109/18.661502
[3] BAUER, E. and KOHAVI, R. 1997. An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Unpublished manuscript.
[4] BAUM, E. B. and HAUSSLER, D. 1989. What size net gives valid generalization? Neural Computation 1 151 160. · Zbl 0743.68083 · doi:10.1007/BF01810851
[5] BOSER, B. E., GUy ON, I. M. and VAPNIK, V. N. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory 144 152. ACM, New York.
[6] BREIMAN, L. 1996. Bagging predictors. Machine Learning 24 123 140. · Zbl 0858.68080 · doi:10.1214/aos/1032181158
[7] BREIMAN, L. 1997. Prediction games and arcing classifiers. Technical Report 504, Dept. Statistics, Univ. California, Berkeley. · Zbl 0934.62064 · doi:10.1214/aos/1024691079
[8] BREIMAN, L. 1998. Arcing classifiers with discussion. Ann. Statist. 26 801 849. · Zbl 0934.62064 · doi:10.1214/aos/1024691079
[9] BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. and STONE, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA. · Zbl 0541.62042
[10] CORTES, C. and VAPNIK, V. 1995. Support-vector networks. Machine Learning 20 273 297. · Zbl 0831.68098
[11] DEVROy E, L. 1982. Bounds for the uniform deviation of empirical measures. J. Multivariate Anal. 12 72 79. · Zbl 0492.60006 · doi:10.1016/0047-259X(82)90083-5
[12] DIETTERICH, T. G. 1998. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Unpublished manuscript.
[13] DIETTERICH, T. G. and BAKIRI, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artificial Intelligence Res. 2 263 286. · Zbl 0900.68358
[14] DONAHUE, M. J., GURVITS, L., DARKEN, C. and SONTAG, E. 1997. Rates of convex approximation in non-Hilbert spaces. Constr. Approx. 13 187 220. · Zbl 0876.41016 · doi:10.1007/BF02678464
[15] DRUCKER, H. 1997. Improving regressors using boosting techniques. In Machine Learning: Proceedings of the Fourteenth International Conference 107 115. Morgan Kauffman, San Francisco.
[16] DRUCKER, H. and CORTES, C. 1996. Boosting decision trees. Advances in Neural Information Processing Sy stems 8 479 485.
[17] FREUND, Y. 1995. Boosting a weak learning algorithm by majority. Inform. Comput. 121 256 285. · Zbl 0833.68109 · doi:10.1006/inco.1995.1136
[18] FREUND, Y. and SCHAPIRE, R. E. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148 156. Morgan Kauffman, San Francisco.
[19] FREUND, Y. and SCHAPIRE, R. E. 1996. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory 325 332.
[20] FREUND, Y. and SCHAPIRE, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139. · Zbl 0880.68103 · doi:10.1006/jcss.1997.1504
[21] FREUND, Y. and SCHAPIRE, R. E. 1998. Adaptive game playing using multiplicative weights. Games and Economic Behavior. · Zbl 0964.91007 · doi:10.1006/game.1999.0738
[22] FRIEDMAN, J. H. 1998. On bias, variance, 0 1-loss, and the curse-of-dimensionality. Available at http: stat.stanford.edu jhf. URL:
[23] GROVE, A. J. and SCHUURMANS, D. 1998. Boosting in the limit: maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Meno Park, NJ.
[24] JONES, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20 608 613. · Zbl 0746.62060 · doi:10.1214/aos/1176348546
[25] KOHAVI, R. and WOLPERT, D. H. 1996. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International Conference 275 283. Morgan Kauffman, San Francisco.
[26] KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning 313 321. Morgan Kauffman, San Francisco.
[27] SCHAPIRE, FREUND, BARTLETT AND LEE 1686 · Zbl 1013.68175 · doi:10.1162/15324430152733133
[28] LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1996. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118 2132. · Zbl 0874.68253 · doi:10.1109/18.556601
[29] LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1998. The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory. · Zbl 0935.68091 · doi:10.1109/18.705577
[30] MACLIN, R. and OPITZ, D. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Artificial Intelligence 546 551. · Zbl 0924.68159
[31] MERZ, C. J. and MURPHY, P. M. 1998. UCI repository of machine learning databases. Available at http: www.ics.uci.edu mlearn MLRepository.html. URL:
[32] QUINLAN, J. R. 1996. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence 725 730.
[33] QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo. · Zbl 0900.68112
[34] SAUER, N. 1972. On the density of families of sets. J. Combin. Theory Ser. A 13 145 147. · Zbl 0248.05005 · doi:10.1016/0097-3165(72)90019-2
[35] SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning 5 197 227.
[36] SCHAPIRE, R. E. 1997. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the Fourteenth International Conference 313 321. Morgan Kauffman, San Francisco.
[37] SCHAPIRE, R. E. and SINGER, Y. 1998. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory. · Zbl 0945.68194 · doi:10.1023/A:1007614523901
[38] SCHWENK, H. and BENGIO, Y. 1998. Training methods for adaptive boosting of neural networks for character recognition. Advances in Neural Information Processing Sy stems 10 647 653. MIT Press.
[39] TIBSHIRANI, R. 1996. Bias, variance and prediction error for classification rules. Technical Report, Univ. Toronto.
[40] VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York. · Zbl 0833.62008
[41] VAPNIK, V. N. and CHERVONENKIS, A. YA. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264 280. · Zbl 0247.60005 · doi:10.1137/1116025
[42] FLORHAM PARK, NEW JERSEY 07932-0971 FLORHAM PARK, NEW JERSEY 07932-0971 E-MAIL: schapire@research.att.com E-MAIL: yoav@research.att.com RSISE, AUSTRALIAN NATIONAL UNIVERSITY UNIVERSITY COLLEGE UNSW CANBERRA, ACT 0200 AUSTRALIAN DEFENCE FORCE ACADEMY AUSTRALIA CANBERRA ACT 2600 E-MAIL: Peter.Bartlett@anu.edu.au AUSTRALIA E-MAIL: w-lee@ee.adfa.oz.au
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.