Chargement Évènements

« Tous les Évènements

Synthèse vocale pour les langues et les dialectes peu dotés

13 décembre @ 9 h 00 min - 17 h 00 min

La journée Synthèse vocale pour les langues et les dialectes peu dotés est organisée par le LIUM, Le Mans Université
Programme
10h introduction
10h15 Kévin Vythelingum (Voxygen) Speech synthesis with a foreign accent from low-resource speaker using self-supervised model representations
11h15 Emmett Strickland (MoDyCo) Experimental and corpus-based phonetics in Nigerian Pidgin: Challenges and perspectives
12h15 lunch
13h30 Marc Evrard et Philippe Boula de Mareuil (LISN) Speech synthesis for Wallon belge accent.
14h30 Imen Laouirine, Fethi Bougares (Elyadata) Transfer Learning based Tunisian Arabic Text-to-Speech System.
15h30 Ana Montalvo (CENATAV) Investigations for TTS in spanish cuban accent.
16h30 round table and discussions
17h end of day
Lieu : Bâtiment IC2 (Institut Claude Chappe) LIUM, Le Mans Université, 72000 Le Mans. Salle 210, deuxième étage
Lien visio : GDR TAL // TTS for low resource languages, dialects and accents
 
Organisatrice :  LIUM (Marie Tahon)

Actuellement, de nombreuses architectures neuronales différentes sont disponibles sur le marché pour la synthèse vocale. Cependant, il n’est pas toujours facile de choisir la meilleure architecture pour une application donnée. En particulier, les limites et les inconvénients des modèles pré-entraînés ne sont pas bien définis. Cela peut être crucial lorsqu’il s’agit d’applications spécifiques telles que la santé, les interactions homme-robot, etc. ou les langues à faibles ressources, les dialectes ou les accents. En effet, un système de synthèse vocale comprend généralement un module de traitement de texte (c’est-à-dire de phonétisation), un encodeur qui prédit une représentation temps-fréquence et un vocodeur qui génère le signal de parole lui-même. Pour traiter ces différents modules, il faut collecter des données audio et obtenir leur transcription linguistique ou phonétique. Pour ce faire, les outils de TAL  (tels que le phonétiseur) doivent être adaptés à leurs langues spécifiques. L’évaluation de la parole synthétique est le dernier goulot d’étranglement : en effet, il n’est pas toujours facile de trouver des locuteurs natifs capables d’évaluer avec précision un signal de parole synthétique dans leur propre langue car l’acculturation à la parole synthétique n’est pas uniforme d’une langue à l’autre.
L’objectif de cette journée est d’obtenir une vue d’ensemble 1) des difficultés à collecter, traiter et gérer des données vocales à faibles ressources, 2) de la manière dont les architectures existantes sont robustes aux langues à faibles ressources, et 3) du protocole d’évaluation lorsque les locuteurs natifs sont rares.

Programme détaillé:
Kévin Vythelingum (Voxygen) Speech synthesis with a foreign accent from low-resource speaker using self-supervised model representations
Self-supervised pretrained models, like Wav2Vec [1], HuBERT [2] or WavLM [3], exhibit excellent performances on many speech tasks such has speech enhancement, automatic speech recognition or speaker diarization. As a result, it shows that representations of these models carry both language and speaker informations. Especially, the authors of kNN-VC [4] demonstrates voice conversion capabilities of WavLM features. Regarding text-to-speech, it is often difficult to model speakers with underrepresented characteristics, like a specific accent. In order to adress this problem, we investigate the use of WavLM features to transfer the accent of speakers to a generic text-to-speech model in a low-resource scenario.
[1] Baevski, Alexei, et al. “wav2vec 2.0: A framework for self-supervised learning of speech representations.” Advances in neural information processing systems 33 (2020): 12449-12460.
[2] Hsu, Wei-Ning, et al. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units.” IEEE/ACM transactions on audio, speech, and language processing 29 (2021): 3451-3460.
[3] Chen, Sanyuan, et al. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing.” IEEE Journal of Selected Topics in Signal Processing 16.6 (2022): 1505-1518.
[4] Baas, Matthew, Benjamin van Niekerk, and Herman Kamper. “Voice conversion with just nearest neighbors.” arXiv preprint arXiv:2305.18975 (2023).
Emmett Strickland (MoDyCo) Experimental and corpus-based phonetics in Nigerian Pidgin: Challenges and perspectives
This talk will present ongoing research aimed at studying the role of pitch and duration in Nigerian Pidgin, a low-resource language of West Africa. The presentation will describe a novel syntactic treebank which combines traditional morphosyntactic annotations with a wide range of phonetic features describing the segmental and suprasegmental features of each syllable. This treebank will then be used to shed light on the prosody of certain syntactic constructions, with a focus on preverbal markers of tense, aspect, and mood (TAM). Finally, the presentation will describe efforts to implement perceptive experiments to validate the findings from the exploration of the corpus. This is carried out using a pitch-controllable text-to-speech system trained on pre-existing field recordings. This portion of the presentation will notably highlight the difficulties in building a task-specific TTS system from a noisy corpus of spontaneous speech which was not made with speech synthesis in mind.
Imen Laouirine, Fethi Bougares (Elyadata) Transfer Learning based Tunisian Arabic Text-to-Speech System. 
Being labeled as a low-resource language, the Tunisian dialect has no existing prior TTS research. At elyadata we collected a a mono-speaker speech corpus of +3 hours of a male speaker sampled at 44100 kHz called TunArTTS. This corpus is processed, manually diacritized and used to initiate the development of end-to-end TTS systems for the Tunisian dialect. Various TTS systems using from scratch training and transfer learning were experimented and compared.  TunArTTS corpus is publicly available for research purposes along with the baseline TTS system demo.
Ana Montalvo (CENATAV) Speech synthesis for Cuban Spanish accent
To be defined
Marc Evrard et Philippe Boula de Mareuil (LISN) Speech synthesis for Wallon belge accent.
We present a text-to-speech system for Walloon, a minority language spoken in Belgium and France. For this project, we used an audio corpus derived from a translation of Petit Prince. A native speaker was recorded to create the corpus. It was segmented into sentences and phonetized by an automatic (rule-based) system developed in-house specifically for Walloon. The synthesis system is based on the VITS architecture (Variational Inference with adversarial learning for end-to-end Text-to-Speech). Several models were trained under different conditions: individual speaker, phonetic and graphemic transcription, with and without fine-tuning from a model pre-trained on a French language corpus. An objective evaluation has been carried out and a perceptive evaluation campaign by native speakers is currently underway. As things stand, the objective evaluation does not allow us to clearly distinguish a trend between the different models. However, perceptually, it seems that fine-tuning models are preferred only when the training condition corresponds to the reduced corpus.

Détails

Date :
13 décembre
Heure :
9 h 00 min - 17 h 00 min

Lieu

LIUM
Avenue Olivier Messiaen
LE MANS, 72085 France
+ Google Map
Voir Lieu site web