ADAPTIVE NOISE CANCELLATION FOR ROBUST SPEECH RECOGNITION IN NOISY ENVIRONMENTS

Davit S. Karamyan

doi:10.46991/PYSU:A.2024.58.1.022

Authors

Davit S. Karamyan Russian-Armenian University (RAU), Armenia

DOI:

https://doi.org/10.46991/PYSU:A.2024.58.1.022

Keywords:

automatic speech recognition, noise cancellation, noise robustness, domain adaptation

Abstract

In this paper, we address the challenges faced when combining noise cancellation and automatic speech recognition models. When these models are combined directly, the performance of word recognition often suffers because the distribution of input data changes. To overcome this limitation, we propose a novel method for combining these models, which enhances the ability of the speech recognition model to perform well in noisy environments. The key feature of the proposed method is the introduction of a mechanism to control the aggressiveness of noise reduction. This mechanism enables us to customize the noise reduction process according to the specific requirements of the ASR model, without necessitating any retraining. This advantage makes our method applicable to any ASR model, facilitating its implementation in practical scenarios.

References

Radford A., Kim J., et al. Robust Speech Recognition Via Large-scale Weak Supervision. International Conference on Machine Learning(2023), 28492-28518. https://doi.org/10.48550/arXiv.2212.04356

Gulati A., Qin J., et al. Conformer: Convolution-augmented Transformer for Speech Recognition. INTERSPEECH (2020), 5036-5040. https://doi.org/10.48550/arXiv.2005.08100

Li J., Lavrukhin V., et al. Jasper: An End-to-End Convolutional Neural Acoustic Model (2019). https://doi.org/10.48550/arXiv.1904.03288

Boll S. Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27 (1979), 113-120. https://doi.org/10.1109/TASSP.1979.1163209

Acero A. Acoustical and Environmental Robustness in Automatic Speech Recognition. Springer Science and Business Media (1992). https://doi.org/10.1007/978-1-4615-3122-7

Cui X., Iseli M., et al. Evaluation of Noise Robust Features on the Aurora Databases. Proc. 7th International Conference on Spoken Language Processing (ICSLP 2002). INTERSPEECH (2002), 481-484. https://doi.org/10.21437/ICSLP.2002-24

Hermansky H., Morgan N. RASTA Processing of Speech. IEEE Transactions on Speech and Audio Processing 2 (1994), 578-589. https://doi.org/10.1109/89.326616

Mošner L., Wu M., et al. Improving Noise Robustness of Automatic Speech Recognition Via Parallel Data and Teacher-Student Learning. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), 6475-6479. https://doi.org/10.48550/arXiv.1901.02348

Gales M., Young S. Robust Continuous Speech Recognition Using Parallel Model Combination. IEEE Transactions on Speech and Audio Processing 4 (1996), 352-359. https://doi.org/10.1109/89.536929

Gong Y. Speech Recognition in Noisy Environments: A Survey. Speech Communication 19 (1995), 261-291. https://doi.org/10.1016/0167-6393(94)00059-J

Lippmann R., Martin E., Paul D. Multi-style Training for Robust Isolated-word Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 12 (1987), 705-708. https://doi.org/10.1109/ICASSP.1987.1169544

Wang Z., Wang X., et al. Oracle Performance Investigation of the Ideal Masks. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC) (2016), 1-5. https://doi.org/10.1109/IWAENC.2016.7602888

Xia S., Li H., Zhang X. Using Optimal Ratio Mask as Training Target for Supervised Speech Separation. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Malaysia, Kuala Lumpur, IEEE (2017). https://doi.org/10.48550/arXiv.1709.00917

Cho K., Merriënboer B., et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 10 (2014), 103-111. https://doi.org/10.3115/v1/W14-4012

Panayotov V., Chen G., et al. Librispeech: An ASR Corpus Based on Public Domain Audio Books. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), 5206-5210. https://doi.org/10.1109/ICASSP.2015.7178964

Snyder D., Chen G., Povey D. Musan: A Music, Speech, and Noise Corpus (2015). https://doi.org/10.48550/arXiv.1510.08484