---
id: 20260306-T0-15
title: "语言奖励模型存在持续性偏见"
title_en: "Language Reward Models Show Persistent Bias Issues"
url: https://ai.daily.yangsir.net/daily/20260306-T0-15
issue_date: 2026-03-06
publish_date: 2026-03-05T05:00:00.000Z
source_name: "arXiv cs.CL (NLP)"
source_url: https://arxiv.org/abs/2603.03291
---

# 语言奖励模型存在持续性偏见

arXiv研究揭示语言奖励模型（RMs）在偏好对齐中易受奖励攻击，导致模型学习到不可取行为。通过系统性分析，发现63%的RMs对特定文化表述存在系统性偏差，且通过常规训练难以消除。该研究为改进对齐算法提供新方向。

## English Version

**Language Reward Models Show Persistent Bias Issues**

arXiv research reveals language reward models (RMs) are vulnerable to reward attacks in preference alignment, causing undesirable behavior learning. Systematic analysis shows 63% of RMs exhibit systematic biases toward cultural expressions, which standard training fails to eliminate. Study provides new directions for alignment algorithms.

---

**来源**：[arXiv cs.CL (NLP)](https://arxiv.org/abs/2603.03291)

**详情页**：https://ai.daily.yangsir.net/daily/20260306-T0-15

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*