×

Block coordinate descent algorithms for large-scale sparse multiclass classification. (English) Zbl 1293.68216

Summary: Over the past decade, \(\ell_1\) regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., \(\ell_1/\ell_2\)) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs \(\ell_1/\ell_2\) regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search [P. Tseng and S. Yun, Math. Program. 117, No. 1–2 (B), 387–423 (2009; Zbl 1166.90016)] and the other without P. Richtárik and M. Takáč [Math. Program. 144, No. 1–2 (A), 1–38 (2014; Zbl 1301.65051); “Parallel coordinate descent methods for big data optimization”, Preprint, arXiv:1212.0873]. We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably compared to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms \(\ell_1/\ell_2\)-regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bach, F. R., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), 1-106. · Zbl 06064248 · doi:10.1561/2200000015
[2] Bakin, S. (1999). Adaptative regression and model selection in data mining problems. Ph.D. thesis, Australian National University.
[3] Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183-202. · Zbl 1175.94009 · doi:10.1137/080716542
[4] Bertsekas, D. P. (1999). Nonlinear programming. Belmont: Athena Scientific. · Zbl 1015.90077
[5] Boser, B. E.; Guyon, I. M.; Vapnik, V. N., A training algorithm for optimal margin classifiers, 144-152 (1992)
[6] Chang, K. W., Hsieh, C. J., & Lin, C. J. (2008). Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9, 1369-1398. · Zbl 1225.68157
[7] Combettes, P., & Wajs, V. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4, 1168-1200. · Zbl 1179.94031 · doi:10.1137/050626090
[8] Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265-292. · Zbl 1037.68110
[9] Dredze, M.; Crammer, K.; Pereira, F., Confidence-weighted linear classification, 264-271 (2008) · doi:10.1145/1390156.1390190
[10] Duchi, J.; Singer, Y., Boosting with structural sparsity, 297-304 (2009)
[11] Duchi, J., & Singer, Y. (2009b). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10, 2899-2934. · Zbl 1235.62151
[12] Elisseeff, A.; Weston, J., A kernel method for multi-labelled classification, 681-687 (2001)
[13] Fan, R. E., & Lin, C. J. (2007). A study on threshold selection for multi-label classification. Tech. rep., National Taiwan University.
[14] Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302-332. · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[15] Friedman, J., Hastie, T., & Tibshirani, R. (2010a). A note on the group lasso and a sparse group lasso. Tech. Rep. arXiv:1001.0736.
[16] Friedman, J. H., Hastie, T., & Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1-22.
[17] Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7, 397-416.
[18] Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99, 67-81. · Zbl 1089.62511 · doi:10.1198/016214504000000098
[19] Mangasarian, O. (2002). A finite Newton method for classification. Optimization Methods and Software, 17, 913-929. · Zbl 1065.90078 · doi:10.1080/1055678021000028375
[20] Meier, L., Van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 70(1), 53-71. · Zbl 1400.62276 · doi:10.1111/j.1467-9868.2007.00627.x
[21] Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231-252. · doi:10.1007/s11222-008-9111-x
[22] Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Tech. rep., Columbia University. · Zbl 1275.90059
[23] Richtárik, P., & Takáč, M. (2012a). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 1-38. · Zbl 1301.65051
[24] Richtárik, P., & Takáč, M. (2012b). Parallel coordinate descent methods for big data optimization. Tech. Rep. arXiv:1212.0873. · Zbl 1342.90102
[25] Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101-141. · Zbl 1222.68287
[26] Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 1-28. · Zbl 1211.90239
[27] Shevade, S. K., & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17), 2246-2253. · doi:10.1093/bioinformatics/btg308
[28] Tseng, P., & Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117, 387-423. · Zbl 1166.90016 · doi:10.1007/s10107-007-0170-0
[29] Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; Attenberg, J., Feature hashing for large scale multitask learning, 1113-1120 (2009)
[30] Weston, J.; Watkins, C., Support vector machines for multi-class pattern recognition, 219-224 (1999)
[31] Wright, S. J. (2012). Accelerated block-coordinate relaxation for regularized optimization. SIAM Journal on Optimization, 22, 159-186. · Zbl 1357.49105 · doi:10.1137/100808563
[32] Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. Transactions on Signal Processing, 57(7), 2479-2493. · Zbl 1391.94442 · doi:10.1109/TSP.2009.2016892
[33] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67. · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[34] Yuan, G. X., Chang, K. W., Hsieh, C. J., & Lin, C. J. (2010). A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11, 3183-3234. · Zbl 1242.62065
[35] Yuan, G. X.; Ho, C. H.; Lin, C. J., An improved glmnet for l1-regularized logistic regression, 33-41 (2011)
[36] Zhang, H. H., Liu, Y., Wu, Y., & Zhu, J. (2006). Variable selection for multicategory svm via sup-norm regularization. Electronic Journal of Statistics, 2, 149-167. · Zbl 1135.62056 · doi:10.1214/08-EJS122
[37] Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541-2563. · Zbl 1222.62008
[38] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.