Posted by Eliya Nachmani, Research Scientist, and Michelle Tadmor Ramanovich, Software Engineer, Google Research
Speech-to-speech translation (S2ST) is a type of machine translation that converts spoken language from one language to another. This technology has the potential to break down language barriers and facilitate communication between people from different cultures and backgrounds. Previously, we introduced Translatotron 1 and Translatotron 2, the first models that were able to directly translate speech between two languages. However, they were trained in supervised settings with parallel speech data.
The scarcity of parallel speech data is a major challenge in this field, so much that most public datasets are semi- or fully-synthesized from text. This poses additional challenges in learning translation and reconstruction of speech attributes that are not represented in the text and are thus not reflected in the synthesized training data.
Here we present Translatotron 3, a novel unsupervised speech-to-speech translation architecture. In Translatotron 3, we show that it is possible to learn a speech-to-speech translation task from monolingual data alone. This method opens the door not only to translation between more language pairs but also towards translation of non-textual speech attributes such as pauses, speaking rates, and speaker identity. Our method does not include any direct supervision to target languages, making it the right direction for preserving paralinguistic characteristics (e.g., tone, emotion) of the source speech across translation.
To enable speech-to-speech translation, we use back-translation, a technique from unsupervised machine translation (UMT) where a synthetic translation of the source language is used to translate texts without bilingual text datasets. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system.
Translatotron 3 addresses the problem of unsupervised S2ST, eliminating the requirement for bilingual speech datasets. Its design incorporates three key aspects: pre-training the entire model as a masked autoencoder with SpecAugment, unsupervised embedding mapping based on multilingual unsupervised embeddings (MUSE), and a reconstruction loss based on back-translation. The model is trained using a combination of unsupervised MUSE embedding loss, reconstruction loss, and S2S back-translation loss.
Translatotron 3 employs a shared encoder to encode both the source and target languages and has separate decoders for each language. The training methodology consists of two parts: auto-encoding with reconstruction and a back-translation term. The network is trained to auto-encode the input to a multilingual embedding space using the MUSE loss and reconstruction loss, and then further trained to translate the input spectrogram using the back-translation loss.
To evaluate the performance of Translatotron 3, we conducted experiments on English and Spanish using various datasets. Translatotron 3 outperformed the baseline cascade system in translation quality, speaker similarity, and speech quality. It achieved speech naturalness similar to that of the ground truth audio samples.
Overall, Translatotron 3 demonstrates the potential of unsupervised speech-to-speech translation and paves the way for more efficient and accurate translation between languages, while preserving important speech attributes.
Source link