Advancing Speech Accessibility with Personal Voice

A voice replicator, known as Personal Voice, was introduced in May 2023 and is available on iOS 17. It is designed to assist individuals who are at risk of losing their ability to speak, such as those diagnosed with ALS or other conditions that progressively impact speech. Personal Voice allows users to create a synthesized voice for themselves to use in various communication settings, including FaceTime, phone calls, assistive communication apps, and in-person conversations.

To create their personalized voice, users are prompted to read aloud a set of randomized text prompts and record 150 sentences using the latest iPhone, iPad, or Mac. The recorded voice audio is then processed and fine-tuned overnight on the device, while it is charging, locked, and connected to Wi-Fi. This process is solely for downloading the pre-trained asset. The next day, the user can type out what they want to say and utilize the Live Speech text-to-speech (TTS) feature to communicate with others in a voice that closely resembles their own. All the model training and inference are done entirely on the device, ensuring privacy and security for the user.

The Personal Voice feature is based on three machine learning approaches. The first approach involves a typical neural TTS system that converts text into speech. It consists of text processing, acoustic modeling, and vocoder modeling. Apple researchers used the Open SLR LibriTTS dataset, which includes 300 hours of recordings from 1000 speakers with diverse speech styles and accents, to develop Personal Voice. The acoustic model and vocoder model are fine-tuned on the device to replicate the target speaker’s voice accurately.

The second machine learning approach is voice model pretraining and fine-tuning. The acoustic model is based on FastSpeech2 architecture, with speaker ID included to learn voice information during pretraining. The vocoder model is based on WaveRNN, and both models are pretrained using the Open SLR LibriTTS dataset. During fine-tuning, the acoustic model is fine-tuned only on the decoder and variance adapters, while the vocoder model undergoes full model adaptation. The entire training process occurs on the user’s Apple device for faster performance.

The third machine learning approach is on-device speech recording enhancement. Users can record their voice samples wherever they choose, but these recordings may contain unwanted sounds. To ensure the best voice quality, speech augmentation techniques are applied to the target-speaker data. This includes filtering out noisy recordings, isolating the voice, augmenting the Mel spectrum (a frequency representation of sound), and recovering the audio signal from the enhanced Mel spectrum. These enhancements significantly improve the quality of the generated voice.

Overall, the Personal Voice feature aims to provide individuals at risk of losing their ability to speak with a powerful tool for communication. It allows users to create their own synthesized voice that closely resembles their natural voice, ensuring a more personalized and comfortable communication experience.

Source link