Thursday, May 8, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

This AI Research Introduces Fast and Expressive LLM Inference with RadixAttention and SGLang

January 24, 2024
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Advanced prompting mechanisms, control flow, contact with external environments, many chained generation calls, and complex activities are expanding the utilization of Large Language Models (LLMs). On the other hand, effective methods for developing and running such programs are severely lacking. LMSYS ORG presents SGLang, a Structured Generation Language for LLMs that collaborates on the architecture of both the backend runtime system and the frontend languages. SGLang improves interactions with LLMs, making them faster and more controllable.

Backend: Automatic KV Cache Reuse with RadixAttention

To take advantage of these reuse opportunities systematically, the team provides RadixAttention, a new automatic KV cache reuse method while running. The KV cache is not removed from the radix tree when a generation request is completed; it is kept for both the generation results and the prompts. This data structure makes efficient search, insertion, and eviction of prefixes possible. To improve the cache hit rate, the researchers employ a cache-aware scheduling policy in conjunction with a Least Recently Used (LRU) eviction policy. It can be eagerly executed using an interpreter or traced as a dataflow graph and run with a graph executor. In the second scenario, compiler optimizations like code relocation, instruction selection, and auto-tuning become possible. 

Frontend: Easy LLM Programming with SGLang

The team also presents SGLang, an embedded domain-specific language in Python, on the front end. Complex methods of prompting, control flow, multi-modality, decoding limitations, and external interaction can be simply articulated using it. Users can run an SGLang function through local models, OpenAI, Anthropic, and Gemini.

As mentioned by the team, much of SGLang’s syntax takes cues from Guidance. Users also deal with batching and intra-program parallelism in addition to introducing new primitives. With all these new features, SGLang is much more powerful than before. Improve the cache hit rate with an eviction policy and a scheduling approach that considers cache awareness.

The researchers recorded the throughput their system attained when testing it on the following typical LLM workloads:

MMLU: A multi-tasking, 5-shot, multiple-choice test.

HellaSwag: An assessment tool for 20-shot, multiple-choice phrase completion.

An agent job based on prompt traces taken from the original ReAct paper is ReAct Agent.

Tree-of-Thought: A GSM-8K problem-solving prompt based on bespoke tree searches.

A JSON decoder can parse a Wikipedia article and return its data in a JSON format.

The chat (short) benchmark is a synthetic chat in which each conversation consists of four turns with brief LLM outputs.

This synthetic chat benchmark uses long LLM outputs and four turns per conversation.

DSPy RAG: A pipeline in the DSPy tutorial that uses retrieval to augment generation.

The LLaVA-in-the-wild benchmark is used to run the vision language model LLaVA v1.5.

Using the Llama-7B and Mixtral-8x7B models on NVIDIA A10G GPUs, the team applied SGLang to typical LLM workloads such as agent, reasoning, extraction, chat, and few-shot learning tasks. The researchers used Hugging Face TGI v1.3.0, advice v0.1.8, and vllm v0.2.5 as a starting point. SGLang outperforms current systems, specifically Guid, by a factor of up to five in terms of throughput. It also performed quite well in latency tests, especially those involving the initial token, where a prefix cache hit is very useful. Current systems do a terrible job of handling sophisticated LLM programs, but while developing the SGLang runtime, it was observed that a critical optimization opportunity: KV cache reuse. By reusing the KV cache, many prompts that share the same prefix can use the intermediate KV cache, which saves both memory and computation. Many other KV cache reuse methods, including ance and vLLM, can be found in complicated programs that use many LLM calls. The automatic KV cache reuse with RadixAttention, the interpreter’s ability to provide intra-program parallelism, and the fact that the frontend and backend systems were co-designed all contribute to these benefits. 

Check out the Code and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

\"\"


Source link

Tags: ExpressiveFastinferenceIntroducesLLMRadixAttentionResearchSGLang
Previous Post

China’s EV players ramp up competition with Tesla using new technology

Next Post

What is the Solana Program Library?

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
What is the Solana Program Library?

What is the Solana Program Library?

The Importance of Fairness In AI

The Importance of Fairness In AI

Key Link Building Trends | Koozai

Key Link Building Trends | Koozai

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In