Posted by Eliya Nachmani, Research Scientist, and Alon Levkovitch, Student Researcher, Google Research
The goal of natural language processing (NLP) is to develop computational models that can understand and generate natural language. By capturing the statistical patterns and structures of text-based natural language, language models can predict and generate coherent and meaningful sequences of words. Enabled by the increasing use of the highly successful Transformer model architecture and with training on large amounts of text (with proportionate compute and model size), large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, modeling spoken human language remains a challenging frontier.
Spoken dialog systems have conventionally been built as a cascade of automatic speech recognition (ASR), natural language understanding (NLU), response generation, and text-to-speech (TTS) systems. However, to date there have been few capable end-to-end systems for the modeling of spoken language: i.e., single models that can take speech inputs and generate its continuation as speech outputs.
Today we present a new approach for spoken language modeling, called Spectron, published in “Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM.” Spectron is the first spoken language model that is trained end-to-end to directly process spectrograms as both input and output, instead of learning discrete speech representations. Using only a pre-trained text language model, it can be fine-tuned to generate high-quality, semantically accurate spoken language. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken question answering datasets.
We show that a pre-trained speech encoder and a language model decoder enable end-to-end training and state-of-the-art performance without sacrificing representational fidelity. Key to this is a novel end-to-end training objective that implicitly supervises speech recognition, text continuation, and conditional speech synthesis in a joint manner. A new spectrogram regression loss also supervises the model to match the higher-order derivatives of the spectrogram in the time and frequency domain. These derivatives express information aggregated from multiple frames at once. Thus, they express rich, longer-range information about the shape of the signal.
Our overall scheme is summarized in the following figure:
The Spectron model connects the encoder of a speech recognition model with a pre-trained Transformer-based decoder language model. At training, speech utterances split into a prompt and its continuation. Then the full transcript (prompt and continuation) is reconstructed along with the continuation’s speech features. At inference, only a prompt is provided; the prompt’s transcription, text continuation, and speech continuations are all generated by the model.
Speech encoder: The speech encoder is a 600M-parameter conformer encoder pre-trained on large-scale data (12M hours). It takes the spectrogram of the source speech as input, generating a hidden representation that incorporates both linguistic and acoustic information.
Language model: We use a 350M or 1B parameter decoder language model (for the continuation and question-answering tasks, respectively) trained in the manner of PaLM 2. The model receives the encoded features of the prompt as a prefix. Note that this is the only connection between the speech encoder and the LM decoder; i.e., there is no cross-attention between the encoder and the decoder.
Acoustic projection layers: To enable the language model decoder to model spectrogram frames, we employ a multi-layer perceptron “pre-net” to project the ground truth spectrogram speech continuations to the language model dimension. To project the LM output from the language model dimension to the spectrogram dimension, the model employs a “post-net”, which is also a multi-layer perceptron. Both pre- and post-networks are two-layer multi-layer perceptrons.
Training objective: The training methodology of Spectron uses two distinct loss functions: (i) cross-entropy loss, employed for both speech recognition and transcript continuation, and (ii) regression loss, employed for speech continuation. During training, all parameters are updated (speech encoder, projection layer, LM, pre-net, and post-net).
Performance: To empirically evaluate the performance of the proposed approach, we conducted experiments on the Libri-Light dataset. We compare our method against existing spoken language models: AudioLM, GSLM, TWIST, and SpeechGPT. Performance metrics include log-perplexity, mean opinion score (MOS), and speaker similarity. Our method outperforms other models in terms of log-perplexity and speaker similarity, and performs competitively in terms of MOS.
Question Answering: For question answering, we use the LLama Questions dataset and the WebQuestions dataset. Our method achieves high accuracy on both datasets.
Overall, Spectron presents a novel approach for spoken language modeling that achieves state-of-the-art performance in speech continuation and question answering tasks. By directly processing spectrograms, it bypasses the need for discrete speech representations and demonstrates the potential for end-to-end systems in modeling spoken language.