具有注意力門之卷積遞迴神經網路於實時單通道語音增強

摘要

 

現今室內或室外環境中,到處存在噪音,這不僅影響語音品質,也影響自動語音辨識。因此,在產品開發上,我們需考慮實時語音增強性能,例如:智慧音箱。傳統語音增強算法對於平穩狀態的噪音,例如:空調聲,具有良好降噪效果。然而,對於非平穩狀態的噪音,例如:風聲,其降噪效果有限。由於,現今深度學習技術盛行,語音增強受益於深度學習,可以有效處理非平穩狀態的噪音。

本論文提出的方法為以具有注意力門(Attention Gates, AG)之卷積遞迴神經網路(Convolutional Recurrent Neural Network, CRNN)模型,來實現語音增強。由於模型結合卷積神經網路(Convolutional Neural Network, CNN)的優點,例如:強大的特徵提取,添加注意力門以增強重要特徵,抑制不相關部分,以及長短期記憶網路(Long Short-Term Memory Network, LSTM)的優點,例如:時間序列動態建模。因此,模型能夠有效地估計出複數比例遮罩(Complex Ratio Mask, CRM),從而獲得更好的語音品質。由於,提出之模型參數量只有 2.3M 計算複雜度低,因此可達到實時語音增強目的。

關鍵字 : 深度學習、實時語音增強、卷積遞迴神經網路

 

Convolutional Recurrent Neural Network With Attention Gates For Real-time Single-channel Speech Enhancement

Abstract

 

In today's indoor or outdoor environment, noises exist everywhere, which not only affect the speech quality but also affect automatic speech recognition. Therefore, in product development, we need to consider the performance of real-time speech enhancement, such as smart speakers. Traditional speech enhancement algorithms have good noise reduction effects for stationary noises, such as air conditioner noises. However, for non-stationary noises, such as wind noises, its noise reduction effects are limited. Due to the popularity of deep learning technology, speech enhancement benefits from deep learning, which can effectively deal with non-stationary noises.

The method proposed in this paper is to use the convolutional recurrent neural network model with attention gates, to achieve speech enhancement. Because the model combines the advantages of the convolutional neural network, such as powerful feature extraction, adding attention gates to enhance important features and suppress irrelevant parts, and the advantages of the long short-term memory network, such as time series dynamic modeling. Therefore, the model can effectively estimate the complex ratio mask, to obtain better speech quality. Since the parameters of the proposed model are only 2.3M, the computational complexity is low, the objective of real-time speech enhancement can be achieved.

Keyword: Deep Learning, Real-time Speech Enhancement, Convolutional Recurrent Neural Network