---
id: 20260509-T0-04
title: "把结果反馈拆成分步信号，LLM推理强化学习学得更快更准"
title_en: "Outcome-to-Process Supervision Converts Sparse Rewards into Step-Level Signals"
url: https://ai.daily.yangsir.net/daily/20260509-T0-04
issue_date: 2026-05-09
publish_date: 2026-05-08T04:00:00.000Z
source_name: "arXiv cs.LG (ML)"
source_url: https://arxiv.org/abs/2605.05226
---

# 把结果反馈拆成分步信号，LLM推理强化学习学得更快更准

强化学习训练LLM推理能力的核心难题是：反馈只在序列末尾给出（结果监督），粒度太粗，模型不知道哪一步做对了、哪一步做错了。研究者提出新范式，将末尾的结果反馈内化为过程级的细粒度监督信号。这样每一步推理都有学习信号，训练效率和推理准确性同步提升。对需要长链推理的任务（数学、代码生成）效果明显。

## English Version

**Outcome-to-Process Supervision Converts Sparse Rewards into Step-Level Signals**

The core challenge in RL for LLM reasoning is sparse outcome-level feedback. Researchers propose internalizing outcome supervision into process supervision, converting end-of-sequence feedback into fine-grained step-level signals. Each reasoning step gets a learning signal, boosting both training efficiency and accuracy on long-chain reasoning tasks like math and code generation.

---

**来源**：[arXiv cs.LG (ML)](https://arxiv.org/abs/2605.05226)

**详情页**：https://ai.daily.yangsir.net/daily/20260509-T0-04

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*