---
id: 20260510-T0-02
title: "PRISM：让具身智能体在多模态环境中边看边想，缩小感知与决策差距"
title_en: "PRISM: Bridging the Perception-Decision Gap in Multimodal Agents"
url: https://ai.daily.yangsir.net/daily/20260510-T0-02
issue_date: 2026-05-10
publish_date: 2026-05-09T04:00:00.000Z
category: research
source_name: "arXiv cs.AI"
source_url: https://arxiv.org/abs/2605.05407
---

# PRISM：让具身智能体在多模态环境中边看边想，缩小感知与决策差距

当前纯视觉语言模型（VLM）在具身智能任务中存在感知、推理与决策脱节的问题，模型经常忽略关键视觉信息导致决策失误。PRISM提出了一种交错感知与推理的序列决策框架，让智能体在处理复杂多模态环境时，能同步整合视觉输入与逻辑推理。该方法有效缩小了单一VLM在多步任务中的性能差距，使具身智能体在复杂环境下的任务完成率得到明显改善。

## English Version

**PRISM: Bridging the Perception-Decision Gap in Multimodal Agents**

Standalone Vision-Language Models (VLMs) often struggle in embodied tasks due to a disconnect between perception, reasoning, and decision-making, frequently missing critical visual cues. PRISM introduces a framework that interleaves perception with reasoning for sequential decision-making. This approach allows agents to dynamically integrate visual inputs and logic in complex multimodal environments, successfully bridging the performance gap of VLMs in multi-step tasks.

---

**来源**：[arXiv cs.AI](https://arxiv.org/abs/2605.05407)

**详情页**：https://ai.daily.yangsir.net/daily/20260510-T0-02

---

*智语观潮 · Daily — https://ai.daily.yangsir.net/llms.txt*