---
id: 20260302-T0-05
title: "基于表征擦除的偏好优化降低LLM有毒输出"
title_en: "Representation Erasure Preference Optimization Reduces Toxic LLM Outputs"
url: https://ai.daily.yangsir.net/daily/20260302-T0-05
issue_date: 2026-03-02
publish_date: 2026-03-02T05:00:00.000Z
source_name: "arXiv cs.LG (ML)"
source_url: https://arxiv.org/abs/2602.23391
---

# 基于表征擦除的偏好优化降低LLM有毒输出

研究人员提出表征擦除偏好优化方法，有效降低大型语言模型的有毒输出概率。该方法在保持模型性能的同时，将有害内容生成率降低40%，优于传统的DPO和NPO算法，为AI安全部署提供了新思路。

## English Version

**Representation Erasure Preference Optimization Reduces Toxic LLM Outputs**

Researchers proposed a representation erasure preference optimization method that significantly reduces toxic output probability in large language models. While maintaining model performance, this approach decreases harmful content generation rate by 40%, outperforming traditional DPO and NPO algorithms, offering new insights for safe AI deployment.

---

**来源**：[arXiv cs.LG (ML)](https://arxiv.org/abs/2602.23391)

**详情页**：https://ai.daily.yangsir.net/daily/20260302-T0-05

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*