---
id: 20260506-T0-09
title: "分词器粒度直接影响模型计算效率，最佳词汇量远大于当前主流设置"
title_en: "Token Granularity Directly Impacts Model Efficiency, Optimal Vocabulary Size Found Larger"
url: https://ai.daily.yangsir.net/daily/20260506-T0-09
issue_date: 2026-05-06
publish_date: 2026-05-05T04:00:00.000Z
category: research
source_name: "arXiv cs.CL (NLP)"
source_url: https://arxiv.org/abs/2605.01188
---

# 分词器粒度直接影响模型计算效率，最佳词汇量远大于当前主流设置

该研究系统性地探讨了分词器的信息粒度对大语言模型计算效率的影响。虽然Scaling Law已广泛应用于优化数据量和模型规模，但作为数据基本单元的Token对计算效率的具体影响一直缺乏深入研究。研究发现，分词器的选择直接影响模型的计算效率和最终表现。开发者可以根据这些结论，在训练新模型时选择计算最优的分词策略和词汇表大小，而非盲目沿用现有的标准配置。

## English Version

**Token Granularity Directly Impacts Model Efficiency, Optimal Vocabulary Size Found Larger**

This study systematically investigates how the information granularity of tokenizers impacts the computational efficiency of large language models. While scaling laws are widely used to optimize data volume and model size, the role of the token as a fundamental data unit remains underexplored. The findings show that tokenizer choice directly affects efficiency and performance. Developers can use these insights to select compute-optimal tokenization strategies rather than defaulting to standard configurations.

---

**来源**：[arXiv cs.CL (NLP)](https://arxiv.org/abs/2605.01188)

**详情页**：https://ai.daily.yangsir.net/daily/20260506-T0-09

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*