Paged Attention & Prefix Caching Now Available in MAX Serve

AI
LLM
MAX
Performance
Announcing state-of-the-art optimizations for LLM inference in MAX Serve - originally published on Modular’s blog.
Author

Ehsan M. Kermani

Published

November 1, 2024

I wrote a post on the Modular blog about two optimizations that just landed in MAX Serve: paged attention and prefix caching. Both attack the same problem from different angles, which is that the KV cache in multi-head attention eats memory and recomputes work it doesn’t have to.

In the article I explain what each one does, how to turn them on, and what the speedups look like in practice.

Read it here: Paged Attention & Prefix Caching Now Available in MAX Serve.