|
|
|
|
|
|
|
|
|
|
![]() |
With recent research advances, deep learning models have become an attractive choice for acoustic echo cancellation (AEC) in real-time teleconferencing applications. Since acoustic echo is one of the major sources of poor audio quality, a wide variety of deep models have been proposed. However, an important but often omitted requirement for good echo cancellation quality is the synchronization of the microphone and far end signals. Typically implemented using classical algorithms based on cross-correlation, the alignment module is a separate functional block with known design limitations. In our work, we propose a deep learning architecture with built-in cross-attention-based alignment, which is able to handle unaligned inputs, improving echo cancellation performance while simplifying the communication pipeline. Moreover, we show that our approach achieves significant improvements for difficult delay estimation cases on real recordings from the AEC Challenge data set. |
Method | FEST | DT Echo | DT Other | Inference time | Measurement |
---|---|---|---|---|---|
MOS | MOS | MOS | per frame (ms) | Device | |
#1 Place AEC Challenge | 4.34 | 4.36 | 4.23 | 2.62 | i5-4300 CPU @1.9GHz |
#2 Place AEC Challenge | 4.44 | 4.44 | 3.90 | 4.385 | Xeon(R) CPU E5-2640@2.50GHz |
#1 Rank AEC Challenge | 4.59 | 4.69 | 4.18 | N/A | N/A |
CRUSE (online) | 4.55 | 4.42 | 4.07 | 0.216 | i7 11370H@3.3 GHz |
Align-CRUSE | 4.67 | 4.45 | 4.07 | 0.218 | i7 11370H@3.3 GHz |
Method | SRR (dB) | SIG | BAK | OVRL | Inference time | Measurement device |
---|---|---|---|---|---|---|
Noisy | 15.24 | 3.13 | 3.26 | 2.67 | - | - |
#1 Place AEC Challenge | 35.87 | 4.39 | 4.59 | 4.11 | 2.62 | i5-4300 CPU @1.9GHz |
#2 Place AEC Challenge | 34.79 | 3.78 | 4.21 | 3.52 | 4.385 | Xeon(R) CPU E5-2640@2.50GHz |
#1 Rank AEC Challenge | 34.37 | 4.37 | 4.58 | 4.07 | N/A | N/A |
CRUSE (online) | 36.54 | 4.41 | 4.54 | 4.11 | 0.216 | i7 11370H@3.3 GHz |
Align-CRUSE | 37.47 | 4.39 | 4.55 | 4.10 | 0.218 | i7 11370H@3.3 GHz |
To have a more comprehensive view of our model, we tested the reverb and noise suppression capability on the blind near end test set from the AEC Challenge. We additionally reported speech-to-reverberation ratio (SRR), the DNSMOS P.835 Signal (SIG), background (BAK) and overall (OVRL). The SRR value is estimated in decibels with an internal deep model. In the above table we observe that Align-CRUSE has competitive results with state-of-the-art methods for both noise suppression and AEC. Regarding the reverb removal capacity, we observe that our model obtains the best results. This highlights that our model is able to jointly enhance the audio quality and cancel echoes.
Considering the inference time and the quality results, we conclude that Align-CRUSE is state-of-the-art method for low-complexity AEC, which also attains very good results for the reverb and noise reduction tasks.
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 1.35, ERLE: 4.92 |
![]() |
Align-CRUSE: |
AECMOS: 4.83, ERLE: 70.55 |
![]() |
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 3.07, ERLE: 31.48 |
![]() |
Align-CRUSE: |
AECMOS: 4.18, ERLE: 43.97 |
![]() |
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 4.63, ERLE: 55.76 |
![]() |
Align-CRUSE: |
AECMOS: 4.73, ERLE: 74.74 |
![]() |
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 4.44, ERLE: 15.91 |
![]() |
Align-CRUSE: |
AECMOS: 4.47, ERLE: 17.36 |
![]() |
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 4.40, ERLE: 44.03 |
![]() |
Align-CRUSE: |
AECMOS: 4.48, ERLE: 46.57 |
![]() |
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 2.68, ERLE: 12.99 |
![]() |
Align-CRUSE: |
AECMOS: 4.70, ERLE: 48.35 |
![]() |
Align feature map ![]() True delay: 305ms, Estimated average delay: 315ms |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 3.45, ERLE: 23.75 |
![]() |
Align-CRUSE: |
AECMOS: 4.72, ERLE: 56.39 |
![]() |
Align feature map ![]() True delay: 355ms, Estimated average delay: 361ms |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 1.97, ERLE: 1.00 |
![]() |
Align-CRUSE: |
AECMOS: 4.80, ERLE: 62.66 |
![]() |
Align feature map ![]() True delay: 916ms, Estimated average delay: 905ms |
Microphone: |
![]() |
|
Far end: |
![]() |
|
CRUSE (online): |
AECMOS: 2.80, ERLE: 0.88 |
![]() |
Align-CRUSE: |
AECMOS: 4.63, ERLE: 28.02 |
![]() |
Align feature map ![]() True delay: 763ms, Estimated average delay: 739ms |
Microphone: |
![]() |
|
Far end: |
![]() |
|
Align-CRUSE: |
![]() |
|
Align feature map ![]() |
Microphone: |
![]() |
|
Far end: |
![]() |
|
Align-CRUSE: |
![]() |
|
Align feature map ![]() |