Voice synthesis has made significant advancements since the Speak & Spell toy from 1978, which impressed people with its ability to read words aloud using an electronic voice. Today, with the help of deep-learning AI models, software can create realistic-sounding voices and even mimic existing voices with just a small sample of audio.
In line with this progress, OpenAI recently introduced Voice Engine, a text-to-speech AI model that can generate synthetic voices based on a 15-second segment of recorded audio. Audio samples of Voice Engine in action are available on the company’s website.
While users can input text into the Voice Engine to get an AI-generated voice result, OpenAI has decided not to widely release the technology yet. Originally planning a pilot program for developers to sign up for the Voice Engine API, the company scaled back its plans due to ethical considerations.
OpenAI explains, “In line with our approach to AI safety and our voluntary commitments, we are choosing to preview but not widely release this technology at this time. We hope this preview of Voice Engine both underscores its potential and also motivates the need to bolster societal resilience against the challenges brought by ever more convincing generative models.”
While voice cloning technology is not new, with several AI voice synthesis models available since 2022, OpenAI’s approach to potentially releasing its voice tech to the public is noteworthy. The company’s cautious stance towards full release is also significant.
OpenAI highlights the benefits of its voice technology, such as providing reading assistance, enabling global reach for creators, supporting non-verbal individuals, and aiding patients in recovering their voice after speech-impairing conditions.
However, the ability to clone voices with just 15 seconds of recorded audio raises concerns about potential misuse, as seen in phone scams and election campaign robocalls. Voice-cloning technology has even been used to break into bank accounts with voice authentication.
Recognizing the possible risks, OpenAI is taking a cautious approach and testing the technology with select partners to address security issues. For instance, HeyGen, a video synthesis company, has been using the model to translate a speaker’s voice into other languages while maintaining the original vocal sound.