---
id: 20260514-T0-07
title: "On-Policy Distillation暗藏三大坑：机制分析揭示训练不稳定的根源"
title_en: "Study Exposes Three Pitfalls in On-Policy Distillation for LLMs"
url: https://ai.daily.yangsir.net/daily/20260514-T0-07
issue_date: 2026-05-14
publish_date: 2026-05-13T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2605.11182
---

# On-Policy Distillation暗藏三大坑：机制分析揭示训练不稳定的根源

在策略蒸馏（OPD）和自蒸馏（OPSD）是当前LLM后训练的热门方法，通过模型自身策略采样轨迹提供密集token级监督。但研究发现现有实践存在多个普遍被忽视的缺陷，导致训练不稳定甚至性能退化。论文系统分析了这些缺陷的产生机制，并给出了对应的修复方案。使用蒸馏方法优化模型的团队应该重新审视当前训练流程，检查是否存在同样的问题，避免浪费算力在无效训练上。

## English Version

**Study Exposes Three Pitfalls in On-Policy Distillation for LLMs**

On-policy distillation and self-distillation are popular post-training methods for LLMs, but this study identifies overlooked defects causing training instability and performance degradation. The paper systematically analyzes how these pitfalls emerge and provides fixes. Teams using distillation methods should audit their training pipelines for these issues to avoid wasting compute on ineffective training runs.

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2605.11182)

**详情页**：https://ai.daily.yangsir.net/daily/20260514-T0-07

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*