---
id: 20260508-T0-14
title: "微调样本可致LLM安全性能崩溃？新方法量化风险"
title_en: "Fine-tuning Samples May Collapse LLM Safety, New Scoring Method Quantifies Risk"
url: https://ai.daily.yangsir.net/daily/20260508-T0-14
issue_date: 2026-05-08
publish_date: 2026-05-07T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2605.04572
---

# 微调样本可致LLM安全性能崩溃？新方法量化风险

斯坦福团队发现，仅用少量无害样本微调即可让LLM丧失百万偏好数据训练的安全行为。研究人员提出参数动态到风险评分的量化方法，可识别导致安全崩溃的关键样本。实验显示，该方法能提前85%预警安全退化，帮助开发者筛选微调数据。论文发表于arXiv，为AI安全微调提供新工具。

## English Version

**Fine-tuning Samples May Collapse LLM Safety, New Scoring Method Quantifies Risk**

Stanford researchers found fine-tuning LLMs with just a few benign samples can erase safety behaviors trained on millions of preference examples. Their new method quantifies safety degradation by linking parameter dynamics to sample-level risk scoring. It predicts 85% of safety collapses, helping developers filter risky fine-tuning data. Published on arXiv.

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2605.04572)

**详情页**：https://ai.daily.yangsir.net/daily/20260508-T0-14

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*