In the field of Artificial Intelligence and Machine Learning, speech recognition models are transforming the way people interact with technology. These models based on the powers of Natural Language Processing, Natural Language Understanding, and Natural Language Generation have paved the way for a wide range of applications in almost every industry. These models are essential to facilitating smooth communication between humans and machines since they are made to translate spoken language into text.
In recent years, exponential progress and growth have been made in speech recognition. OpenAI models like the Whisper series have set a good standard. OpenAI introduced the Whisper series of audio transcription models in late 2022 and these models have successfully gained popularity and a lot of attention among the AI community, from students and scholars to researchers and developers.
The pre-trained model Whisper, which has been created for speech translation and automatic speech recognition (ASR), is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. It was trained on a large dataset with 680,000 hours of labeled speech data, and it exhibits an exceptional capacity to generalize across many datasets and domains without requiring fine-tuning.
The Whisper model stands out for its adaptability as it can be trained on both multilingual and English-only data. The English-only models anticipate transcriptions in the same language as the audio, concentrating on the speech recognition job. On the other hand, the multilingual models are trained to predict transcriptions in a language other than the audio for both voice recognition and speech translation. This dual capability allows the model to be used for several purposes and increases its adaptability to different linguistic settings.
Significant variations of the Whisper series include Whisper v2, Whisper v3, and Distil Whisper. Distil Whisper is an upgraded version trained on a larger dataset and is a more simplified version with faster speed and a smaller size. Examining each model’s overall Word Error Rate (WER), a seemingly paradoxical finding becomes apparent, which is that the larger models have noticeably greater WER than the smaller ones.
A thorough evaluation revealed that the large models’ multilingualism, which frequently causes them to misidentify the language based on the speaker’s accent, is the cause of this mismatch. After removing these mis-transcriptions, the results become more clear-cut. The studies showed that the revised large V2 and V3 models have the lowest WER, while the Distil models have the highest WER.
Models tailored to English regularly prevent transcription errors in non-English languages. Having access to a more extensive audio dataset, in terms of language misidentification rate, the large-v3 model has been shown to outperform its predecessors. When evaluating the Distil Model, though it demonstrated good performance even when it was across different speakers, there are some more findings, which are as follows.
Distil models may fail to recognize successive sentence segments, as shown by poor length ratios between the output and label.
The Distil models sometimes perform better than the base versions, especially when it comes to punctuation insertion. In this regard, the Distil medium model stands out in particular.
The base Whisper models may omit verbal repetitions by the speaker, but this is not observed in the Distil models.
Following a recent Twitter thread by Omar Sanseviero, here is a comparison of the three Whisper models and an elaborate discussion of which model should be used.
Whisper v3: Optimal for Known Languages – If the language is known and language identification is reliable, it is better to opt for the Whisper v3 model.
Whisper v2: Robust for Unknown Languages – Whisper v2 shows improved dependability if the language is unknown or if Whisper v3’s language identification is not reliable.
Whisper v3 Large: English Excellence – Whisper v3 Large is a good default option if the audio is always in English and memory or the inference performance is not an issue.
Distilled Whisper: Speed and Efficiency – Distilled Whisper is a better choice if memory or inference performance is important and the audio is in English. It is six times faster, 49% smaller, and performs within 1% WER of Whisper v2. Even with occasional challenges, it performs almost as well as slower ones.
In conclusion, the Whisper models have significantly advanced the field of audio transcription and can be used by anyone. The decision to choose between Whisper v2, Whisper v3, and Distilled Whisper totally depends on the particular requirements of the application. Thus, an informed decision requires careful consideration of factors like language identification, speed, and model efficiency.