DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation
Ando Saabas
Evgenii Indenbom
Ristea Nicolae Catalin
Tanel Parnamaa
Jegor Guzvin
Ross Cutler
Microsoft Corporation
Paper link




Abstract

Acoustic echo cancellation (AEC), noise suppression (NS) and dereverberation (DR) are an integral part of modern full-duplex communication systems. As the demand for teleconferencing systems increases, addressing these tasks is required for an effective and efficient online meeting experience. Most prior research proposes solutions for these tasks separately, combining them with digital signal processing (DSP) based components, resulting in complex pipelines that are often impractical to deploy in real-world applications. This paper proposes a real-time cross-attention deep model, named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to simultaneously address AEC, NS, and DR. We conduct several ablation studies to analyze the contributions of different components of our model to the overall performance. DeepVQE achieves state-of-the-art performance on non-personalized tracks from the ICASSP 2023 Acoustic Echo Cancellation Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle both tasks with excellent performance. Moreover, the model runs in real-time and has been successfully deployed in production for one of the major communication platforms.


Official DEMO




2021 AEC Challenge test set (FEST-GEN and FEST-HD)


Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.18, ERLE: 43.97

Align feature map

DeepVQE:

AECMOS: 4.69, ERLE: 55.59

Align feature map


Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.73, ERLE: 74.74

Align feature map

DeepVQE:

AECMOS: 4.65, ERLE: 79.92

Align feature map


Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.47, ERLE: 17.36

Align feature map

DeepVQE:

AECMOS: 4.80, ERLE: 12.97

Align feature map


Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.48, ERLE: 46.57

Align feature map

DeepVQE:

AECMOS: 4.45, ERLE: 54.89

Align feature map


Synthetic (0.3-0.5)s delays test set (LD-M)


This test set is synthetically created, therefore we included the label delay and the estimated delay from the align feature map.
Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.70, ERLE: 48.35

Align feature map

True delay: 305ms, Estimated average delay: 315ms

DeepVQE:

AECMOS: 4.72, ERLE: 54.10

Align feature map

True delay: 305ms, Estimated average delay: 302ms


Synthetic (0.5-1.0)s delays test set (LD-H)


This test set is synthetically created, therefore we included the label delay and the estimated delay from the align feature map.
Microphone:
Far end:
Align-CRUSE:

AECMOS: 4.63, ERLE: 28.02

Align feature map

True delay: 763ms, Estimated average delay: 739ms

DeepVQE:

AECMOS: 4.79, ERLE: 64.65

Align feature map

True delay: 763ms, Estimated average delay: 752ms


Additional information