Mistral AI Introduces Mixtral 8x7B: a Sparse Mixture of Experts (SMoE) Language Model Transforming Machine Learning

In recent research, a team of researchers from Mistral AI has presented Mixtral 8x7B, a language model based on the new Sparse Mixture of Experts (SMoE) model with open weights. Licensed under the Apache 2.0 license and as a sparse network of a mixture of experts, Mixtral serves just as a decoder model.

The team has shared that Mixtral’s feedforward block has been chosen from eight different parameter groups. Every layer and token has two parameter groups, called experts, that are dynamically selected by the router network to process the token and combine their results additively. As only a portion of the total parameters are used for every token, this method efficiently increases the model’s parameter space while preserving cost and latency control.

Mistral has been pre-trained utilizing multilingual data with a 32k token context size. It has performed on par with or better than Llama 2 70B and GPT-3.5 in a number of benchmarks. One of its main advantages is its effective use of parameters, which permits quicker inference times at small batch sizes and higher throughput at large batch sizes.

Mixtral outperformed Llama 2 70B substantially in tests including multilingual understanding, code production, and mathematics. Experiments have shown that Mixtral can effectively recover data from its context window of 32k tokens, regardless of the length and position of the data inside the sequence.

To guarantee a fair and accurate assessment, the team re-ran benchmarks using their evaluation pipeline as they compared the Mixtral and Llama models in detail. The assessment consists of a wide range of problems divided into categories such as math, code, reading comprehension, common sense thinking, world knowledge, and popular aggregated findings.

Commonsense reasoning tasks such as ARC-Easy, ARC-Challenge, Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA, and CommonsenseQA have been assessed in a 0-shot environment. Among the world knowledge tasks assessed in a 5-shot format were TriviaQA and NaturalQuestions. BoolQ and QuAC were the reading comprehension tasks that were evaluated in a 0-shot environment. Math tasks incorporated GSM8K and MATH, while code-related tasks encompassed Humaneval and MBPP. Popular consolidated findings for AGI Eval, BBH, and MMLU have also been included in the research.

The study has also presented Mixtral 8x7B – Instruct, a conversation model optimized for instructions. Direct preference optimization and supervised fine-tuning were used in the procedure. In human review benchmarks, Mixtral – Instruct has performed better than GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat model. Benchmarks like BBQ and BOLD have shown fewer biases and a more balanced sentiment profile.

In order to promote wide accessibility and a variety of applications, Mixtral 8x7B and Mixtral 8x7B – Instruct have both been licensed under the Apache 2.0 license, allowing both commercial and academic use. By adding Megablocks CUDA kernels for effective inference, the team has modified the vLLM project.

In conclusion, this study highlights the exceptional performance of Mixtral 8x7B, Using a thorough comparison with Llama models on a wide range of benchmarks. Mixtral does exceptionally well in a variety of activities, from problems involving math and code to reading comprehension, reasoning, and general knowledge.

Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

$\"\"$

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

[Free AI Event] 🐝 \’Real-Time AI with Kafka and Streaming Data Analytics\’ (Jan 15 2024, 10 am PST)

Source link