We focus on stereo singing voice cancellation as part of music source separation. The objective is to estimate the instrumental background from a stereo mix. Our goal is to achieve performance comparable to large, state-of-the-art source separation networks using a small, efficient model designed for real-time speech separation. This model is particularly useful in scenarios where memory and computing resources are limited, and singing voice processing needs to be done with minimal look-ahead.
To achieve this, we adapt an existing mono model to handle stereo input. By fine-tuning model parameters and expanding the training set, we are able to improve the quality of the output. Additionally, we introduce a new metric that detects inconsistencies in attenuation between channels, showcasing the advantages of using a stereo model.
We evaluate our approach using objective offline metrics and a comprehensive MUSHRA trial, demonstrating the effectiveness of our techniques in rigorous listening tests.