Thursday, May 8, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Meet Medusa: An Efficient Machine Learning Framework for Accelerating Large Language Models (LLMs) Inference with Multiple Decoding Heads

January 26, 2024
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


The most recent advancement in the field of Artificial Intelligence (AI), i.e., Large Language Models (LLMs), has demonstrated some great improvement in language production. With model sizes reaching billions of parameters, these models are stepping into every domain, ranging from healthcare and finance to education.

Though these models have shown amazing capabilities, the development of the model’s size has led to an increased inference latency, which poses a problem for real-world applications. Memory-bound operations represent the main bottleneck in LLM inference, as it is inefficient to transport all model parameters from High Bandwidth Memory (HBM) to the accelerator’s cache during auto-regressive decoding.

Researchers have been putting in efforts to find a solution to these limitations, one of which is to decrease the number of decoding steps and increase the arithmetic intensity of the decoding process. Using a smaller draft model for speculative decoding, which produces a series of tokens that are then improved upon by the bigger original model, has been suggested. However, there are difficulties with incorporating a draft model into a distributed system.

To overcome these challenges, a team of researchers in a recent study has presented MEDUSA, an efficient approach that enhances LLM inference by incorporating additional decoding heads to predict multiple subsequent tokens in parallel. It uses the backbone model’s numerous decoding heads to speed up inference. These heads overcome the difficulties of speculative decoding by simultaneously predicting numerous tokens.

MEDUSA doesn’t require a separate draft model like speculative decoding requires, which makes it capable of getting easily integrated into current LLM systems, even in dispersed situations. The team has shared that MEDUSA builds several candidate continuations in each decoding phase and verifies them concurrently using a tree-based attention mechanism. By utilizing parallel processing, MEDUSA lowers the number of necessary decoding steps while introducing very little overhead in terms of single-step latency.

Two new insights have been added to MEDUSA. First, numerous candidate continuations have been generated using MEDUSA heads, and they have been verified simultaneously. Secondly, an acceptance procedure has been used to choose suitable candidates. The team has shared the rejection sampling strategy used in speculative decoding, which a temperature-based threshold can effectively substitute to handle deviations.

The study has suggested two methods for fine-tuning LLMs’ predictive MEDUSA heads, which are as follows.

MEDUSA-1: This allows lossless inference acceleration by directly fine-tuning MEDUSA on top of a frozen backbone LLM. MEDUSA-1 has been suggested to be used when incorporating MEDUSA into an existing model or in settings with limited computational resources. It uses less memory and can be made even more efficient by applying quantization techniques.

MEDUSA-2: This method adjusts MEDUSA and the main LLM simultaneously. While it offers a greater speedup and improved prediction accuracy for MEDUSA heads, it necessitates a unique training recipe to maintain the backbone model’s functionality. MEDUSA-2 is appropriate when resources are plentiful and permits simultaneous training of MEDUSA heads and the backbone model without sacrificing output quality or next-token prediction ability.

The research has also suggested several additions to enhance or broaden the use of MEDUSA. These include a usual acceptance scheme to increase the acceptance rate without sacrificing generation quality and a self-distillation method in the absence of training data. The team has shared that the evaluation process of MEDUSA included testing on models of different sizes and training protocols. The results have demonstrated that MEDUSA-1 can accelerate data by more than 2.2 times without sacrificing generation quality. Moreover, the acceleration is improved to 2.3-3.6× using MEDUSA-2.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter.

Don’t Forget to join our Telegram Channel

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🧑‍💻 [FREE AI WEBINAR]’LangChain for Multimodal Apps: Chat With Text/Image Data’ (Jan 26, 2024)



Source link

Tags: AcceleratingDecodingEfficientFrameworkHeadsinferencelanguageLargeLearningLLMsMachineMedusaMeetmodelsMultiple
Previous Post

Azure-Managed vs Oracle Consulting: A Deep Dive into Cloud Offerings

Next Post

Krutrim Secures $50 Million, Becomes India’s First AI Unicorn

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Krutrim Secures $50 Million, Becomes India’s First AI Unicorn

Krutrim Secures $50 Million, Becomes India's First AI Unicorn

Tensoic AI Releases Kan-Llama: A 7B Llama-2 LoRA PreTrained and FineTuned on ‘Kannada’ Tokens

Tensoic AI Releases Kan-Llama: A 7B Llama-2 LoRA PreTrained and FineTuned on 'Kannada' Tokens

Walgreens will pay $275,000 to settle Vermont allegations about bad service and untenable working conditions during pandemic

Walgreens will pay $275,000 to settle Vermont allegations about bad service and untenable working conditions during pandemic

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In