End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation

Abstract

Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into speech with normal prosody. Dysarthric patients speak more slowly, leading to an excessively long response time in such systems, making them unacceptable in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can reduce these delays. However, patients with varying degrees of dysarthria exhibit significant pronunciation differences for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average RTF of 0.71. On UASpeech, it achieves 4.67 MOS, showing a 23.51% relative improvement in WER compared to the SOTA.

Diagram of Model

Model Diagram
Fig: End-to-end simultaneous DSR system

Chinese commercial Test Set

intelligibility-low

Sample Original ASR-TTS(E2) Conformer RNNT(pinyin) E2E-SDSR wait-1(E7) E2E-SDSR wait-10(E10) E2E-SDSR wait-20(E8) Label
178_M_1 n i d e q ue x ian sh i sh en m e 你的缺点是什么
178_M_2 w o x ian ch i f an l e 我先吃饭了
106_W_1 t a h ua b an j iu x iang w o m en d e sh ou 它花瓣就像我们的手
106_W_2 j iu d iao zh e x ie b u y in r en zh u m u d e j iao g uan 揪掉这些不引人注目的小花

intelligibility-middle

Sample Original ASR-TTS(E2) Conformer RNNT(pinyin) E2E-SDSR wait-1(E7) E2E-SDSR wait-10(E10) E2E-SDSR wait-20(E8) Label
20_M_1 sh eng h uo h en p in d an y ou h en n an 生活很平淡又很难
20_M_2 y i g e r en zh u d uo j iu l e k en d ing h ui l ei l e 一个人主动久了肯定会累了
39_W_1 x ian z ai b u sh i sh uo ch i y i a 现在不是双十一啊
39_W_2 k an l ai zh en d e sh i sh ou b u d ao j ie j ie d e zh u f u l e 看来真的是收不到姐姐的祝福了

UASpeech Test Set

intelligibility-middle

Sample Original E2E-DSR(E12) ASA-DSR(E13) Unit-DSR(E14) Conformer RNNT(phone) E2E-SDSR wait-10(E10) Label
M05_1 AE2 D V AH0 N T EY1 JH AH0 S advantageous
M05_2 D AW1 N W ER0 D downward
F04_1 K AH0 M AE1 N D command
F04_2 B AE1 K S P EY1 S backspace

intelligibility-low

Sample Original E2E-DSR(E12) ASA-DSR(E13) Unit-DSR(E14) Conformer RNNT(phone) E2E-SDSR wait-10(E10) Label
M07_1 P AE1 R AH0 G R AE2 F paragraph
M07_2 W AA1 CH AH0 Z watches
F02_1 EH0 S K EY1 P escape
F02_2 S EH1 N T ER0 N S sentence

Note: The red parts highlighted in Conformer RNNT indicate incorrectly recognized pinyin or phone.

-->