Dealing with complexity is a challenge for AI models that need to operate in real time on devices like headphones with limited computing power and battery life. To address these constraints, the neural networks had to be small and energy-efficient. The team utilized knowledge distillation, an AI compression technique, to achieve this. This involved using a large AI model trained on millions of voices (the “teacher”) to train a smaller model (the “student”) to mimic its behavior and performance.
The student was then trained to extract vocal patterns of specific voices from surrounding noise captured by microphones on noise-canceling headphones. To activate the Target Speech Hearing system, the wearer holds down a button on the headphones for a few seconds while facing the person they want to focus on. During this “enrollment” process, the system captures an audio sample from both headphones to extract the speaker’s vocal characteristics, even in the presence of other speakers and noises.
These characteristics are input into a second neural network running on a microcontroller computer connected to the headphones via USB cable. This network runs continuously, keeping the chosen voice separate from others and playing it back to the listener. Once the system locks onto a speaker, it continues to prioritize that person’s voice, even if the wearer looks away. The system improves its ability to isolate a voice with more training data on that specific speaker.
Currently, the system can only successfully enroll a targeted speaker when their voice is the dominant one present. However, the team aims to make it work even when the loudest voice in a specific direction is not the target speaker.
Sefik Emre Eskimez, a senior researcher at Microsoft specializing in speech and AI, acknowledges the difficulty of singling out a single voice in a noisy environment. He believes that achieving this goal could lead to various applications, especially in a meeting setting.
While speech separation research is often more theoretical than practical, this work has tangible real-world applications, according to Samuele Cornell, a researcher at Carnegie Mellon University’s Language Technologies Institute. Cornell views this as a positive step forward in the field, providing a new perspective on the topic.