---
id: 20260414-T0-04
title: "SPPO：针对长程推理任务的序列级PPO算法"
title_en: "SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks"
url: https://ai.daily.yangsir.net/daily/20260414-T0-04
issue_date: 2026-04-14
publish_date: 2026-04-13T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2604.08865
---

# SPPO：针对长程推理任务的序列级PPO算法

针对标准PPO算法在长程推理任务中时间信用分配不稳定的问题，研究者提出SPPO（序列级PPO）。该方法通过序列级别的优化策略，提升了大型语言模型在复杂推理任务中的对齐效果。论文显示，SPPO能够更有效地处理需要多步推理的问题，论文已发布于arXiv:2604.08865v1。

## English Version

**SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks**

To address temporal credit assignment instability in standard PPO for long-horizon reasoning, researchers propose SPPO (Sequence-Level PPO), which optimizes at sequence level to improve LLM alignment on complex tasks, demonstrating better multi-step reasoning capabilities (arXiv:2604.08865v1).

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2604.08865)

**详情页**：https://ai.daily.yangsir.net/daily/20260414-T0-04

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*