---
id: 20260509-T0-07
title: "混合架构LLM推理加速：稀疏前缀缓存让状态空间模型服务效率翻倍"
title_en: "Sparse Prefix Caching Accelerates Hybrid LLM Serving via State Reuse"
url: https://ai.daily.yangsir.net/daily/20260509-T0-07
issue_date: 2026-05-09
publish_date: 2026-05-08T04:00:00.000Z
source_name: "arXiv cs.LG (ML)"
source_url: https://arxiv.org/abs/2605.05219
---

# 混合架构LLM推理加速：稀疏前缀缓存让状态空间模型服务效率翻倍

现有前缀缓存方案假设每个token的键值对都被密集复用，但状态空间模型（如Mamba）改变了这个前提——循环层只需一个存储的状态就能恢复计算，不需要逐token缓存。研究者提出Sparse Prefix Caching方案，专门适配混合架构（Transformer+SSM）和纯循环模型的推理服务。该方案大幅降低缓存开销和推理延迟，对部署混合架构LLM的服务端场景有直接价值。

## English Version

**Sparse Prefix Caching Accelerates Hybrid LLM Serving via State Reuse**

Existing prefix caching assumes dense per-token key/value reuse, but state-space models change this: recurrent layers can resume from a single stored state. Researchers propose Sparse Prefix Caching for hybrid (Transformer+SSM) and recurrent LLM serving, slashing cache overhead and inference latency for production deployments.

---

**来源**：[arXiv cs.LG (ML)](https://arxiv.org/abs/2605.05219)

**详情页**：https://ai.daily.yangsir.net/daily/20260509-T0-07

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*