DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation

Ando Saabas

Evgenii Indenbom

Ristea Nicolae Catalin

Tanel Parnamaa

Jegor Guzvin

Ross Cutler

Microsoft Corporation

Paper link

Abstract

Acoustic echo cancellation (AEC), noise suppression (NS) and dereverberation (DR) are an integral part of modern full-duplex communication systems. As the demand for teleconferencing systems increases, addressing these tasks is required for an effective and efficient online meeting experience. Most prior research proposes solutions for these tasks separately, combining them with digital signal processing (DSP) based components, resulting in complex pipelines that are often impractical to deploy in real-world applications. This paper proposes a real-time cross-attention deep model, named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to simultaneously address AEC, NS, and DR. We conduct several ablation studies to analyze the contributions of different components of our model to the overall performance. DeepVQE achieves state-of-the-art performance on non-personalized tracks from the ICASSP 2023 Acoustic Echo Cancellation Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle both tasks with excellent performance. Moreover, the model runs in real-time and has been successfully deployed in production for one of the major communication platforms.

Official DEMO

2021 AEC Challenge test set (FEST-GEN and FEST-HD)

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.18, ERLE: 43.97
		Align feature map
DeepVQE:	AECMOS: 4.69, ERLE: 55.59
		Align feature map

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.73, ERLE: 74.74
		Align feature map
DeepVQE:	AECMOS: 4.65, ERLE: 79.92
		Align feature map

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.47, ERLE: 17.36
		Align feature map
DeepVQE:	AECMOS: 4.80, ERLE: 12.97
		Align feature map

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.48, ERLE: 46.57
		Align feature map
DeepVQE:	AECMOS: 4.45, ERLE: 54.89
		Align feature map

Synthetic (0.3-0.5)s delays test set (LD-M)

This test set is synthetically created, therefore we included the label delay and the estimated delay from the align feature map.

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.70, ERLE: 48.35
		Align feature map True delay: 305ms, Estimated average delay: 315ms
DeepVQE:	AECMOS: 4.72, ERLE: 54.10
		Align feature map True delay: 305ms, Estimated average delay: 302ms

Synthetic (0.5-1.0)s delays test set (LD-H)

This test set is synthetically created, therefore we included the label delay and the estimated delay from the align feature map.

Microphone:
Far end:
Align-CRUSE:	AECMOS: 4.63, ERLE: 28.02
		Align feature map True delay: 763ms, Estimated average delay: 739ms
DeepVQE:	AECMOS: 4.79, ERLE: 64.65
		Align feature map True delay: 763ms, Estimated average delay: 752ms

Additional information

We included samples only for DeepVQE and Align-CRUSE.
The align feature map represents the delay distributions (between 0 and 1 second) for 20ms frames. More precisely, the vertical axis represents the probability delay distribution for a certain frame index.
For a better align feature map visualization, we included only the middle 500 frames from a sample.
The estimated average delay is computed by averaging the highest probability delay for each frame in the sample. It was reported only for synthetically data sets since for those samples we have the ground-truth.