---
id: 20260514-T0-03
title: "SOMA用小模型接管多轮对话上下文，LLM推理成本大幅降低"
title_en: "SOMA Cuts LLM Serving Costs in Multi-Turn Dialogs Using Small Models"
url: https://ai.daily.yangsir.net/daily/20260514-T0-03
issue_date: 2026-05-14
publish_date: 2026-05-13T04:00:00.000Z
category: research
source_name: "arXiv cs.CL (NLP)"
source_url: https://arxiv.org/abs/2605.11317
---

# SOMA用小模型接管多轮对话上下文，LLM推理成本大幅降低

多轮对话场景下，标准做法是每轮都拼接完整对话历史发送给LLM，随对话轮次增加，显存占用和推理延迟急剧上升。研究提出SOMA方案，用小语言模型处理和维护多轮对话的上下文状态，只将必要信息传递给大模型。这种大小模型分工的方式大幅降低了LLM的输入token数量和显存占用。搭建对话系统的团队可以用类似架构在不牺牲对话质量的前提下，显著降低API调用和推理成本。

## English Version

**SOMA Cuts LLM Serving Costs in Multi-Turn Dialogs Using Small Models**

In multi-turn dialogs, concatenating full conversation history each turn causes escalating memory and latency costs. SOMA uses a small language model to manage conversational context, passing only essential information to the LLM. This division of labor between large and small models drastically reduces input tokens and memory usage. Teams building dialog systems can adopt similar architectures to cut API and inference costs without sacrificing conversation quality.

---

**来源**：[arXiv cs.CL (NLP)](https://arxiv.org/abs/2605.11317)

**详情页**：https://ai.daily.yangsir.net/daily/20260514-T0-03

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*