OpenAI, the artificial intelligence company that introduced ChatGPT to the world in November, is enhancing the chatbot app’s conversational abilities.
The latest update to the ChatGPT mobile apps for iOS and Android allows users to speak their queries to the chatbot and receive responses in its own synthesized voice. Additionally, the new version of ChatGPT includes visual features. Users can upload or take a photo within the app, and it will provide a description of the image and offer further context, similar to Google’s Lens feature.
These new capabilities of ChatGPT demonstrate that OpenAI treats its artificial intelligence models as products, regularly updating them. ChatGPT is becoming more like a consumer app that competes with Siri and Alexa.
Improving the ChatGPT app’s appeal could give OpenAI an advantage in the race against other AI companies, such as Google, Anthropic, InflectionAI, and Midjourney. It would provide OpenAI with more data from users to help train its powerful AI engines. Additionally, incorporating audio and visual data into the machine learning models behind ChatGPT could contribute to OpenAI’s long-term goal of creating more human-like intelligence.
OpenAI’s language models, including the latest GPT-4, were developed using large volumes of text from various sources on the web. Many AI experts believe that to create more advanced AI, it may be necessary to feed algorithms with audio and visual information, in addition to text, just as animal and human intelligence rely on various types of sensory data.
Google’s upcoming major AI model, Gemini, is rumored to be “multimodal,” meaning it will handle more than just text, potentially supporting video, images, and voice inputs. Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup combining natural language with image generation and manipulation, suggests that multimodal models are likely to outperform models trained on a single modality. He states, “If we build a model using just language, no matter how powerful it is, it will only learn language.”
OpenAI’s new voice generation technology, developed in-house, also presents licensing opportunities. Spotify, for example, plans to use OpenAI’s speech synthesis algorithms to pilot a feature that translates podcasts into additional languages, mimicking the original podcaster’s voice using AI.
The updated ChatGPT app includes icons for headphones, as well as photo and camera functions. These voice and visual features convert input information to text using speech or image recognition, allowing the chatbot to generate a response. The app can respond via voice or text, depending on the user’s mode. When a WIRED writer spoke to the new ChatGPT, asking if it could “hear” her, the app replied, “I can’t hear you, but I can read and respond to your text messages” since the voice query is processed as text. It offers responses in five different voices named Juniper, Ember, Sky, Cove, or Breeze.