---
id: 20260606-T0-03
title: "SentinelBench：首个长时间运行AI代理评测基准"
title_en: "SentinelBench: Benchmark for Long-Running AI Agents"
url: https://ai.daily.yangsir.net/daily/20260606-T0-03
issue_date: 2026-06-06
publish_date: 2026-06-05T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2606.05342
---

# SentinelBench：首个长时间运行AI代理评测基准

arXiv发布SentinelBench，首个针对长时间运行AI代理的评测基准。传统AI代理模型仅支持连续动作，而现实中的任务常需持续数小时。该基准测试代理在长时间任务中的表现，如刷新页面、搜索替代方案等，填补了现有评测空白。开发者可据此优化长时间任务的代理性能。

## English Version

**SentinelBench: Benchmark for Long-Running AI Agents**

arXiv released SentinelBench, the first benchmark for long-running AI agents. Unlike traditional models that only support continuous actions, real-world tasks often span hours. This benchmark tests agent performance in persistent tasks like refreshing pages or searching for alternatives, filling a gap in existing evaluations. Developers can use it to optimize agent performance for long-duration tasks.

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2606.05342)

**详情页**：https://ai.daily.yangsir.net/daily/20260606-T0-03

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*