Multichannel Voice Trigger Detection Based on Transform-average-concatenate

This paper was accepted at the workshop HSCMA at ICASSP 2024.

Voice triggering (VT) enables users to activate their devices by simply speaking a trigger phrase. A front-end system is commonly used for speech enhancement and/or separation, generating multiple enhanced and/or separated signals. Traditional VT systems only accept single-channel audio input, leading to channel selection. However, this method discards unused channels that may contain valuable information for VT. In this study, we introduce multichannel acoustic models for VT, where the output from the front-end is directly inputted into the VT model. We utilize a transform-average-concatenate (TAC) block and enhance it by integrating the channel from conventional channel selection, enabling the model to focus on a specific speaker in the presence of multiple speakers. Our proposed approach demonstrates a 30% decrease in false rejection rate compared to the standard channel selection method.

Source link