基於時頻分離卷積壓縮網路之聲音事件定位與偵測

基於時頻分離卷積壓縮網路之聲音事件定位與偵測

摘要

隨著人工智慧的技術發展日益盛行，更多的領域紛紛朝以機器取代或輔助人力的方向進行研究。在音訊領域中，聲音事件定位與偵測便是其中之一。近期的研究主要透過深度學習的方法，使機器具有人耳聽覺的能力，以辨別環境中各種突發之聲音事件及其所處之位置與移動軌跡。

本論文提出時頻分離卷積壓縮網路(Time-Frequency Separable Convolutional Compression Network, TFSCCN)作為聲音事件定位與偵測的系統架構，透過不同維度大小的1-D卷積核，分別對時間與頻率成分進行特徵提取，用於捕捉同一時間下不同聲音事件的頻率分布，或者在連續時間中，聲音事件的持續時間以及相位或延遲的變化。同時，透過控制通道數降維與升維的時機點，大幅降低模型的參數量。另外，模型結合多頭自注意力機制(Multi-head self-attention)來獲取時間序列特徵中的全局與局部資訊，以及透過雙分支追蹤技術來對相同或相異的重疊聲音事件進行有效的定位與偵測。實驗結果表明，在DCASE 2020 Task 3的評估機制中與Baseline相比，偵測的錯誤率下降了37%，而角度定位誤差則降低了14°。另外，與其他以降低參數量方法為目的所建構的網路模型相比，TFSCCN不僅具有最少的參數量，同時也具有最佳的聲音事件定位與偵測的表現。

關鍵字 : 聲音事件定位與偵測、時頻分離卷積壓縮網路、多頭自注意力機制、雙分支追蹤

Sound Event Localization and Detection Based on Time-Frequency Separable Convolutional Compression Network

Abstract

With the increasing prevalence of artificial intelligence technology, in the audio field, sound event localization and detection is one of the fast growing research topics. By simulating the hearing ability of human ears, it can distinguish various sound events in the environment and locate their spatial locations and movement trajectories.

In this work, we propose a Time-Frequency Separable Convolutional Compression Network (TFSCCN) as a system architecture for sound event localization and detection, which uses 1-D convolution kernels of different dimensions to extract features of time and frequency components separately. It can distinguish each sound event class according to the different characteristics of the frequency distribution of different sound events. Meanwhile, it can also track the spatial location and movement trajectory. In addition, we greatly reduce the number of model parameters by controlling the timing of the increase and decrease of the number of channels. In the overall system, we also use multi-head self-attention to obtain global and local information in time series features, and use dual-branch tracking technology to effectively locate and detect the same or different overlapping sound events.

Experimental results show that compared with baseline in the evaluation metrics of DCASE 2020 Task 3, the detection error rate is reduced by 37%, and the localization error is reduced by 14°. In addition, compared with other lightweight models, TFSCCN not only has the fewest number of parameters, but also has the best sound event localization and detection performance.

Keyword: sound event localization and detection, time-frequency separable convolutional compression network, multi-head self-attention, dual-branch tracking