---
id: 20260528-T0-08
title: "GEM：打破传统分类缺陷，用几何熵实现LLM最优数据配比"
title_en: "GEM Uses Geometric Entropy for Optimal LLM Data Mixing"
url: https://ai.daily.yangsir.net/daily/20260528-T0-08
issue_date: 2026-05-28
publish_date: 2026-05-27T04:00:00.000Z
category: research
source_name: "arXiv cs.LG (ML)"
source_url: https://arxiv.org/abs/2605.26121
---

# GEM：打破传统分类缺陷，用几何熵实现LLM最优数据配比

LLM预训练的效果越来越依赖数据配比而非单纯的数据量。现有方法存在明显缺陷：人工分类容易产生本体论错位，欧几里得聚类无法准确处理嵌入空间。论文提出的GEM（Geometric Entropy Mixing）方法，通过几何熵来优化数据组合，绕过了传统分类的限制。研究指出该方法能提供更优的数据混合策略，直接提升模型预训练的效率和最终表现。数据工程师可以将其应用于预训练数据流水线，降低试错成本。

## English Version

**GEM Uses Geometric Entropy for Optimal LLM Data Mixing**

LLM pre-training efficacy depends more on data composition than sheer volume. The paper introduces GEM (Geometric Entropy Mixing), which bypasses flaws in human taxonomies and Euclidean clustering to optimize data mixture. Data engineers can use this to improve pre-training pipelines and reduce trial-and-error costs.

---

**来源**：[arXiv cs.LG (ML)](https://arxiv.org/abs/2605.26121)

**详情页**：https://ai.daily.yangsir.net/daily/20260528-T0-08

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*