非平行語料庫基於生成注意力網路之語音轉換技術

非平行語料庫基於生成注意力網路之語音轉換技術

摘要

語音轉換(Voice Conversion, VC)是一種較為複雜的技術，其目的為將原始語者的音色和音調做轉換，並保留語音內容，讓輸出後的結果聽起來像是目標語者所講出的。

本篇論文使用了非平行的語料庫作為訓練數據，並提出加入注意力機制的循環生成對抗網路(Cycle Generative Adversarial Network, Cycle-GAN)用於語音轉換上，在轉換過程中能對不同語者特徵上的差異給予更多的權重，讓轉換時更能針對差異的地方做轉換，並保留較相似的片段。我們在架構中加入注意力模塊，並加入了新的損失函數用來更新網路。由於訓練生成對抗網路時會遇到不穩定的問題，因此我們針對鑑別器的損失函數部分，對真實樣本與生成後的樣本鑑別時給予不同的權重來改善。

上述方法我們用於轉換頻譜包絡(音色)上，但我們也針對基本頻率(音調)嘗試使用生成對抗網路做轉換，並與原先轉換的方法做分析比較。最後從實驗結果表明在梅爾倒譜失真(Mel-Cepstral distortion, MCD)與平均意見分數(Mean Opinion Score, MOS)中，我們所提出語音轉換架構較基線系統好。

關鍵字 : 語音轉換、生成對抗網路、注意力機制、非平行語料庫

Spectrum and Prosody Transformation for Non-parallel Voice Conversion with Generative Attentional Networks

Abstract

Voice Conversion (VC) is a complex technology designed to convert the pitch and timbre of the original speaker and preserve the speech content, let the output sounds like what the target speaker said.

This paper uses non-parallel corpus as training data, and proposes a Cycle Generation Adversarial Network (Cycle-GAN) with attention mechanisms for voice conversion, which can give more weight to differences in the characteristics of different speakers during the transformation process, so that the conversion can be made more closely to the differences, and some similarities are retained. We added attention modules to the architecture and new loss functions to update the network. Because we often encounter unstable problems in training GAN, we give different weights to real and generated samples for the loss function part of the discriminator.

The above methods are used to transform the spectrum envelope, but we also try to convert using the GAN for the fundamental frequency and compare it with the original conversion method. Finally, the experimental results show that in Mel-Cepstral distortion (MCD) and Mean Opinion Score (MOS), we proposed voice conversion architecture is better than the baseline system.

Keyword: Voice conversion, Generator Adversarial Networks, Attention, Non-parallel data