×

Context-adaptive pre-processing scheme for robust speech recognition in fast-varying noise environment. (English) Zbl 1217.68196

Summary: Based on the observation that dissimilar speech enhancement algorithms perform differently for different types of interference and noise conditions, we propose a context-adaptive speech pre-processing scheme, which performs adaptive selection of the most advantageous speech enhancement algorithm for each condition. The selection process is based on an unsupervised clustering of the acoustic feature space and a subsequent mapping function that identifies the most appropriate speech enhancement channel for each audio input, corresponding to unknown environmental conditions. Experiments performed on the MoveOn motorcycle speech and noise database validate the practical value of the proposed scheme for speech enhancement and demonstrate a significant improvement in terms of speech recognition accuracy, when compared to the one of the best performing individual speech enhancement algorithm. This is expressed as accuracy gain of 3.3% in terms of word recognition rate. The advance offered in the present work reaches beyond the specifics of the present application, and can be beneficial to spoken interfaces operating in fast-varying noise environments.

MSC:

68T10 Pattern recognition, speech recognition

Software:

C4.5
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] S. Bohm, J. Koolwaaij, M. Luther, B. Souville, M. Wagner, M., Wibbels, Introducing IYOUIT, in: Proceedings of the International Semantic Web Conference (ISWC’08), vol. 5318 of LNCS, Springer Verlag, 2008, pp. 804–817.
[2] U. Gartner, W. Konig, T. Wittig, Evaluation of manual vs. speech input when using a driver information system in real traffic, in: Proceedings of the Driving Assessment 2001: The First International Driving Symposium on Human Factors in Driver Assessment, Training and Vechicle Design, CO. USA, 2001, pp. 7–13.
[3] Berton, A.; Buhler, D.; Minker, W.: Smartkom-mobile car: user interaction with mobile services in a car environment, Smartkom: foundations of multimodal dialogue systems, 523-537 (2006)
[4] Cohen, I.; Berdugo, B.: Speech enhancement for non-stationary noise environments, Signal processing 81, No. 11, 2403-2418 (2001) · Zbl 0985.94009 · doi:10.1016/S0165-1684(01)00128-1
[5] A. Moreno, B. Linderberg, C. Draxler, G. Richard, K. Choukri, S. Euler, J. Allen, SPEECHDAT-CAR: a large speech database for automotive environments, in: Proceedings of the LREC 2000. Athens, Greece, 2000.
[6] J.H.L. Hansen, X. Zhang, M. Akbacak, U. Yapanel, B. Pellom, W. Ward, CU-Move: advances in in-vehicle speech systems for route navigation, in: Proceedings of the IEEE Workshop on DSP in Mobile and Vehicular Systems, Nagoya, Japan, 2003, pp. 19–45.
[7] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, AVICAR: audio–visual speech corpus in a car environment, in: Proceedings of the ICSLP 2004, Jeju Island, Korea, 2004, pp. 2489–2492.
[8] M. Kaiser, H. Mogele, F. Shiel, Bikers accessing the web: the SmartWeb motorbike corpus, in: Proceedings of the LREC 2006, Genoa, Italy, 2006.
[9] T. Winkler, T. Kostoulas, R. Adderley, C. Bonkowski, T. Ganchev, J. Kohler, N. Fakotakis, The MoveOn motorcycle speech corpus, in: Proceedings of the LREC 2008. Marrakech, Morocco, 2008.
[10] Hansen, J. H. L.; Clements, M. A.: Constrained iterative speech enhancement with application to speech recognition, IEEE transactions on audio, speech and signal processing 39, No. 4, 795-805 (1991)
[11] Lockwood, P.; Boundy, J.: Experiments with a nonlinear spectral subtractor (NSS), hmms and the projection, for robust speech recognition in cars, Speech communication 11, No. 2–3, 215-228 (1992)
[12] Visser, E.; Otsuka, M.; Lee, T. W.: A spatio-temporal speech enhancement scheme for robust speech recognition in noisy environments, Speech communication 41, No. 2–3, 393-407 (2003)
[13] Dal Degan, N.; Prati, C.: Acoustic noise analysis and speech enhancement techniques for mobile radio applications, Signal processing 15, No. 1, 43-56 (1988)
[14] Li, J.; Sakamoto, S.; Hongo, S.; Akagi, M.; Suzuki, Y.: Adaptive \({\beta}\)-order generalized spectral subtraction for speech enhancement, Signal processing 88, No. 11, 2764-2776 (2008) · Zbl 1151.94383 · doi:10.1016/j.sigpro.2008.06.005
[15] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in: Proceedings of the IEEE ICASSP’79, Washington, DC, USA, 1979, pp. 208–211.
[16] Martin, R.: Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE transactions on speech and audio processing 9, No. 5, 504-512 (2001)
[17] S. Kamath, P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in: Proceedings of the ICASSP-2002, Orlando, USA, vol. 4, 2002, pp. 4164–4167.
[18] Ephraim, Y.; Malah, D.: Speech enhancement using a minimum mean square error log-spectral amplitude estimator, IEEE transactions on acoustics, speech, signal processing 33, 443-445 (1985)
[19] Loizou, P.: Speech enhancement based on perceptually motivated Bayesian estimators of the speech magnitude spectrum, IEEE transactions on speech and audio processing 13, No. 5, 857-869 (2005)
[20] Hu, Y.; Loizou, P.: Speech enhancement by wavelet thresholding the multitaper spectrum, IEEE transactions on speech and audio processing 12, No. 1, 59-67 (2004)
[21] Hu, Y.; Loizou, P.: A generalized subspace approach for enhancing speech corrupted by coloured noise, IEEE transactions on speech and audio processing 11, 334-341 (2003)
[22] Jabloun, F.; Champagne, B.: Incorporating the human hearing properties in the signal subspace approach for speech enhancement, IEEE transactions on speech and audio processing 11, No. 6, 700-708 (2003)
[23] Gannot, S.; Burshtein, D.; Weinstein, E.: Iterative and sequential Kalman filter-based speech enhancement algorithms, IEEE transactions on speech and audio processing 6, No. 4, 373-385 (1998)
[24] Mporas, I.; Kocsis, O.; Ganchev, T.; Fakotakis, N.: Robust speech interaction in motorcycle environment, Expert systems with applications 37, No. 3, 1827-1835 (2010)
[25] B. Sarama, A. Khan, Refined Detailed Technical Specification of MoveOn System (D14), url: \langlehttp://www.m0ve0n.net angle, 2008.
[26] T. Winkler, T. Ganchev, T. Kostoulas, I. Mporas, A. Lazaridis, S. Ntalampiras, A. Badii, R. Adderley, C. Bonkowski, MoveOn Deliverable D.5: Report on Audio Databases, Noise Processing Environment, ASR and TTS Modules, 2007.
[27] V. Krishnan, P.S. Whitehead, D.V. Anderson, M.A. Clements, A framework for estimation of clean speech by fusion of outputs from multiple speech enhancement systems, in: Proceedings of the Interspeech-2005, 2005, pp. 2317–2320.
[28] Hu, Y.; Loizou, P.: Subjective evaluation and comparison of speech enhancement algorithms, Speech communication 49, 588-601 (2007)
[29] Baum, L. E.; Petrie, T.; Soules, G.; Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Annals of mathematical statistics 41, No. 1, 164-171 (1970) · Zbl 0188.49603 · doi:10.1214/aoms/1177697196
[30] Ntalampiras, S.; Ganchev, T.; Potamitis, I.; Fakotakis, N.: Objective comparison of speech enhancement algorithms under real world conditions, Proceedings of the first international conference on pervasive technologies related to assistive environments (PETRA-2008) 282, 1-5 (2008)
[31] Loizou, P.: Speech enhancement: theory and practice, (2007)
[32] Young, S.; Evermann, G.; Gales, M.; Hain, T.; Kershaw, D.; Moore, G.; Odell, J.; Ollason, D.; Povey, D.; Valtchev, V.; Woodland, P.: The HTK book (for HTK version 3.3), (2005)
[33] Keerthis.S., S. S.; Shevade, S. K.; Bhattacharyya, C.; Murthy, K. R. K.: Improvements to platt’s SMO algorithm for SVM classifier design, Neural computation 13, No. 3, 637-649 (2001) · Zbl 1085.68629 · doi:10.1162/089976601300014493
[34] Witten, I. H.; Frank, E.: Data mining: practical machine learning tools and techniques (2nd ed., morgan–kaufman series of data management systems), (2005) · Zbl 1076.68555
[35] Frank, E.; Wang, Y.; Inglis, S.; Holmes, G.; Witten, I. H.: Using model trees for classification, Machine learning 32, No. 1, 63-76 (1998) · Zbl 0901.68167 · doi:10.1023/A:1007421302149
[36] Mitchell, T. M.: Machine learning, (1997) · Zbl 0913.68167
[37] Breiman, L.: Bagging predictors, Machine learning 24, No. 2, 123-140 (1996) · Zbl 0858.68080
[38] Quinlan, R.: C4.5: programs for machine learning, (1993)
[39] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th International Conference on Machine Learning, San Francisco, USA, 1996, pp. 148–156.
[40] Aha, D.; Kibler, D.: Instance-based learning algorithms, Machine learning 6, 37-66 (1991) · Zbl 0709.68044
[41] Gauvain, J. L.; Lee, C.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE transactions on speech and audio processing 2, 291-299 (1994)
[42] H. Hoge, C. Draxler, H. Van den Heuvel, F.T. Johansen, E. Sanders, H.S. Tropf, SpeechDat multilingual speech databases for teleservices: across the finish line, in: Proceedings of the Eurospeech 1999, Budapest, Hungary, 1999, pp. 2699–2702.
[43] Wells, J. C.: SAMPA computer readable phonetic alphabet, Handbook of standards and resources for spoken language systems (Part IV, section B) (1997)
[44] Wilcoxon, F.: Individual comparisons by ranking methods, Journal of biometrics 1, 80-83 (1945)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.