KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Large Language Model or LLM inference involves two phases: the prompt (or prefill) phase, which outputs the first token, and the extension (or decoding) phase, which generates subsequent tokens. In this study, we introduce a parallelization scheme called KV-Runahead to speed up the prompt phase. The key insight is that the extension phase is faster at generating tokens due to the key-value cache (KV-cache). As a result, KV-Runahead parallelizes the prompt phase by coordinating multiple processes to populate the KV-cache, reducing the time-to-first-token (TTFT). By repurposing the KV-cache, we benefit in two ways. Firstly, leveraging the causal attention map in the KV-cache design minimizes computation automatically. Secondly, KV-Runahead is easy to implement since the KV-cache is already in place for the extension phase. We also propose context-level load-balancing to optimize TTFT and address uneven KV-cache generation caused by the causal attention. Our experiments show that compared to existing parallelization schemes like tensor or sequential parallelization, where keys and values are generated locally and exchanged via all-gather collectives, KV-Runahead can achieve speedups of over 1.4× and 1.6× for Llama 7B and Falcon 7B respectively.

Source link

No Result