×

Out-of-bag estimation of the optimal sample size in bagging. (English) Zbl 1191.68592

Summary: The performance of \(m\)-out-of-\(n\) bagging with and without replacement in terms of the sampling ratio \((m/n)\) is analyzed. Standard bagging uses resampling with replacement to generate bootstrap samples of equal size as the original training set \(m_{wor}=n\). Without-replacement methods typically use half samples \(m_{wr}=n/2\). These choices of sampling sizes are arbitrary and need not be optimal in terms of the classification performance of the ensemble. We propose to use the out-of-bag estimates of the generalization accuracy to select a near-optimal value for the sampling ratio. Ensembles of classifiers trained on independent samples whose size is such that the out-of-bag error of the ensemble is as low as possible generally improve the performance of standard bagging and can be efficiently built.

MSC:

68T10 Pattern recognition, speech recognition

Software:

UCI-ml
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Breiman, L., Bagging predictors, Machine Learning, 24, 2, 123-140 (1996) · Zbl 0858.68080
[2] J.R. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National Conference on Artificial Intelligence, Cambridge, MA, 1996, pp. 725-730.; J.R. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National Conference on Artificial Intelligence, Cambridge, MA, 1996, pp. 725-730.
[3] Opitz, D.; Maclin, R., Popular ensemble methods: an empirical study, Journal of Artificial Intelligence Research, 11, 169-198 (1999) · Zbl 0924.68159
[4] Bauer, E.; Kohavi, R., An empirical comparison of voting classification algorithms, boosting, and variants: bagging, Machine Learning, 36, 1-2, 105-139 (1999)
[5] Dietterich, T. G., An experimental comparison of three methods for constructing ensembles of decision trees, boosting, and randomization: bagging, Machine Learning, 40, 2, 139-157 (2000)
[6] Webb, G. I., Multiboosting: a technique for combining boosting and wagging, Machine Learning, 40, 2, 159-196 (2000)
[7] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in: ICML ’06: Proceedings of the 23rd International Conference on Machine Learning, ACM Press, New York, NY, USA, 2006, pp. 161-168. doi: http://doi.acm.org/10.1145/1143844.1143865; R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in: ICML ’06: Proceedings of the 23rd International Conference on Machine Learning, ACM Press, New York, NY, USA, 2006, pp. 161-168. doi: http://doi.acm.org/10.1145/1143844.1143865
[8] Efron, B.; Tibshirani, R. J., An Introduction to the Bootstrap (1994), Chapman & Hall, CRC: Chapman & Hall, CRC New York, Boca Raton, FL
[9] Bühlmann, P.; Yu, B., Analyzing bagging, Annals of Statistics, 30, 927-961 (2002) · Zbl 1029.62037
[10] Buja, A.; Stuetzle, W., Observations on bagging, Statistica Sinica, 16, 323-351 (2006) · Zbl 1096.62034
[11] Friedman, J. H.; Hall, P., On bagging and nonlinear estimation, Journal of Statistical Planning and Inference, 137, 3, 669-683 (2007) · Zbl 1104.62047
[12] Hartigan, J., Using subsample values as typical values, Journal of the American Statistical Society, 64, 1303-1317 (1969)
[13] L. Breiman, Out-of-bag estimation, Technical Report, Statistics Department, University of California, 1996.; L. Breiman, Out-of-bag estimation, Technical Report, Statistics Department, University of California, 1996.
[14] Swanepoel, J. W.H., A note on proving that the (modified) bootstrap works, Communications in Statistics—Theory and Methods, 15, 3193-3203 (1986) · Zbl 0623.62041
[15] Bickel, P. J.; Gtze, F.; van Zwet, W. R., Resampling fewer then \(n\) observations, losses, and remedies for losses: gains, Statistica Sinica, 7, 1-31 (1997) · Zbl 0927.62043
[16] Chung, K.-H.; Lee, S. M.S., Optimal bootstrap sample size in construction of percentile confidence bounds, Scandinavian Journal of Statistics, 28, 225-239 (2001) · Zbl 0965.62026
[17] Politis, D.; Romano, J. P.; Wolf, M., Subsampling, Springer Series in Statistics (1999), Springer: Springer Berlin
[18] Davison, A. C.; Hinkley, D. V.; Young, G. A., Recent developments in bootstrap methodology, Statistical Science, 18, 141-157 (2003) · Zbl 1331.62179
[19] Breiman, L., Pasting small votes for classification in large databases and on-line, Machine Learning, 36, 1-2, 85-103 (1999)
[20] Bühlmann, P., Bagging subagging and bragging for improving some prediction algorithms, (Akritas, M. G.; Politis, D. N., Recent Advances and Trends in Nonparametric Statistics (2003), Elsevier: Elsevier New York), 19-34
[21] M. Terabe, T. Washio, H. Motoda, The effect of subsampling rate on subagging performance, in: Proceedings of ECML2001/PKDD2001 Workshop on Active Learning, Database Sampling, and Experimental Design: Views on Instance Selection, 2001, pp. 48-55.; M. Terabe, T. Washio, H. Motoda, The effect of subsampling rate on subagging performance, in: Proceedings of ECML2001/PKDD2001 Workshop on Active Learning, Database Sampling, and Experimental Design: Views on Instance Selection, 2001, pp. 48-55. · Zbl 1029.68907
[22] Hall, P.; Samworth, R. J., Properties of bagged nearest neighbour classifiers, Journal of the Royal Statistical Society Series B, 67, 3, 363-379 (2005) · Zbl 1069.62051
[23] P.J. McCarthy, Replication: an approach to the analysis of data from complex surveys, Vital Health Statistics, Public Health Service Publication 14 (1979).; P.J. McCarthy, Replication: an approach to the analysis of data from complex surveys, Vital Health Statistics, Public Health Service Publication 14 (1979).
[24] B. Efron, The jackknife, the bootstrap, and other resampling plans, Society of Industrial and Applied Mathematics CBMS-NSF Monographs 38 (1982).; B. Efron, The jackknife, the bootstrap, and other resampling plans, Society of Industrial and Applied Mathematics CBMS-NSF Monographs 38 (1982). · Zbl 0496.62036
[25] A. Asuncion, D. Newman, UCI machine learning repository, 2007 URL \(\langle;\) http://www.ics.uci.edu/∼;mlearn/MLRepository.html \(\rangle;\); A. Asuncion, D. Newman, UCI machine learning repository, 2007 URL \(\langle;\) http://www.ics.uci.edu/∼;mlearn/MLRepository.html \(\rangle;\)
[26] Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J., Classification and Regression Trees (1984), Chapman & Hall: Chapman & Hall New York · Zbl 0541.62042
[27] Breiman, L., Arcing classifiers, Annals of Statistics, 26, 3, 801-849 (1998) · Zbl 0934.62064
[28] Demšar, J., Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 1-30 (2006) · Zbl 1222.68184
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.