---
id: 20260603-T0-13
title: "LLM交互推理新基准：多轮游戏测试方案发布"
title_en: "New Benchmark for LLM Interactive Reasoning"
url: https://ai.daily.yangsir.net/daily/20260603-T0-13
issue_date: 2026-06-03
publish_date: 2026-06-02T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2606.00103
---

# LLM交互推理新基准：多轮游戏测试方案发布

研究者推出可执行游戏的多层次交互推理基准，测试LLM主动获取证据的能力。该方法将推理过程分为查询、整合、更新三个阶段，通过隐藏环境评估LLM的策略决策能力。基准已在Codeforces数据集验证，准确率提升20%。

## English Version

**New Benchmark for LLM Interactive Reasoning**

Researchers introduce a hierarchical benchmark for LLM interactive reasoning using executable games. It evaluates active evidence acquisition across query, integration, and update phases. Validated on Codeforces, showing 20% accuracy improvement over traditional methods.

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2606.00103)

**详情页**：https://ai.daily.yangsir.net/daily/20260603-T0-13

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*