---
id: 20260324-T0-22
title: "MoE模型推理提速：预测专家策略减少内存占用"
title_en: "Speculating Experts Accelerates MoE Inference by Reducing Memory Load"
url: https://ai.daily.yangsir.net/daily/20260324-T0-22
issue_date: 2026-03-24
publish_date: 2026-03-23T04:00:00.000Z
category: research
source_name: "arXiv cs.LG (ML)"
source_url: https://arxiv.org/abs/2603.19289
---

# MoE模型推理提速：预测专家策略减少内存占用

arXiv研究提出预测专家策略，可显著提升MoE模型在内存受限环境下的推理速度。MoE模型通过稀疏激活扩展大语言模型容量，但专家权重加载成为瓶颈。新方法通过预测下次激活的专家，提前加载权重，减少内存峰值使用量达40%。测试显示，在8GB显存环境下，GPT-2XL推理速度提升2.1倍。

## English Version

**Speculating Experts Accelerates MoE Inference by Reducing Memory Load**

New research proposes a speculation strategy that accelerates Mixture-of-Experts inference under memory constraints. By predicting which experts will be activated next, the method pre-loads weights, reducing peak memory usage by 40%. Testing shows GPT-2XL achieves 2.1x faster inference on 8GB GPUs.

---

**来源**：[arXiv cs.LG (ML)](https://arxiv.org/abs/2603.19289)

**详情页**：https://ai.daily.yangsir.net/daily/20260324-T0-22

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*