In this paper, we explore the use of Federated Learning (FL) to train End-to-End Automatic Speech Recognition (ASR) models. We aim to address the performance gap in word error rate between FL-trained models and their centralized counterparts. We investigate several factors that can impact this gap, including adaptive optimizers, altering Connectionist Temporal Classification (CTC) weight to modify loss characteristics, model initialization through seed start, carrying over modeling setup from centralized training to FL, and FL-specific hyperparameters for ASR under heterogeneous data distribution. We highlight the effectiveness of certain optimizers in inducing smoothness and discuss the applicability of algorithms and trends from prior works in FL. Figure 1 illustrates the overlap of central model updates for Yogi and Adam optimizers in the first 50 aggregation rounds. The wider diagonal white beam for Yogi compared to Adam demonstrates the additional smoothening achieved by Yogi, reducing the impact of heterogeneity among client updates.
Source link