CVPR 2025 Embodied AI Workshop Challenge:
Social Mobile Manipulation Challenge
Sat February 1th - Thu May 1th, 2025SYSU, Guangzhou, China
The 1st Social Mobile Manipulation Challenge focuses on developing embodied AI agents capable of
performing long sequences of complex tasks through social interactions. These tasks involve reasoning
about human intentions and planning within dynamic, multi-agent environments (Figure 1). The challenge
includes two interaction modes: human-robot and robot-robot. The goal is to advance the community's
capabilities in areas like human intention reasoning, long-term task planning, and effective social
interaction between embodied agents.
Figure 1: The dynamic interaction environment that agents will navigate in both human-robot and
robot-robot tasks.
Vedio1: Two Go2 collaborate to grab a bottle of beverage and interact with humans during the process
Benchmark: Open World Social Mobile Manipulation.
In this benchmark, we designed an open-world social mobile manipulation.It mainly includes two
interaction methods: hierarchical interaction and horizontal interaction. The former simulates embodied
AI interaction with hierarchical knowledge structure, and the latter simulates embodied AI interaction
with equal knowledge acquisition capabilities.
Hierarchical interaction:
In hierarchical interaction tasks, it is used to simulate the agent
interaction mode with a hierarchical knowledge structure in the environment. For example, compared to
ordinary agents, administrators (such as salespersons, etc.) clearly have more knowledge about the
environment. Encouraging agents to have conversations with administrators, can help agents
better understand user intentions and improve task execution success rates.
Horizontal interaction:
In horizontal interaction tasks, it is used to simulate the “passer-by
interaction scene”. There is no administrator with a “God's perspective” in the scene, and all agents
can obtain scene knowledge equally. Specifically, the scene contains multiple agents with the same
status. They can independently build their own scene graphs and transfer knowledge through social
dialogue to improve the efficiency and success rate of task completion.