MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasonings

MesaTask Towards Task-Driven Tabletop Scene Generation
via 3D Spatial Reasoning

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory, ³SII, ⁴Southern University of Science and Technology, ⁵Peking University

* equal contribution, ‡ project lead, ✉️ corresponding author

💡Abstract

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.

🎨 Interactive Scene Gallery

3D Scene Viewer

📋 Task:

Loading task description...

Tip

Loading 3D Scene...

Object Information

Tip

Click on objects in the 3D scene to view their details

Select Scene

🤖 Manipulation Demo

Demonstration of robotic manipulation on our generated tabletop scenes.

📍 Task 1: Place the banana on the plate.

📍 Task 2: Organize the office table by moving the mouse to the middle.

Our MesaTask framework generates sim-ready tabletop scenes that support robotic manipulation tasks.

🗂️ Data Curation

The dataset construction pipeline. First, an LLM is used to generate diverse tabletop scene descriptions, including relevant object lists and spatial relations. Conditioned on the scene description, a text-to-image model synthesizes reference images, from which coarse 3D layouts are built using depth estimation, object detection, and 3D asset retrieval. These layouts are refined through human annotations and physical simulation to ensure spatial plausibility, yielding high-quality 3D layouts.

⚙️ Method

🔍 Overview of our MesaTask Framework

1) Task-to-Scene Generation (upper-left)

Given a task instruction, we extract detailed task information including environment, sub-goals, and task-relevant objects. A structured spatial reasoning chain performs object list completion, interrelation inference, and scene graph construction, which guides the generation of 3D layouts. Final scenes are obtained via 3D asset retrieval.

2) Reasoning Data Construction (bottom)

Based on scene graphs and descriptions of our MesaTask-10K dataset, a multimodal LLM is leveraged to produce task instructions, detailed task information, and complete object lists and interrelations.

3) DPO Data Construction (upper right)

To enable DPO training, we generate negative examples by randomly perturbing object positions or relations and removing key objects from normal layouts.

@misc{hao2025mesatask, title={MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning}, author={Hao, Jinkun and Liang, Naifu and Luo, Zhen and Xu, Xudong and Zhong, Weipeng and Yi, Ran and Jin, Yichen and Lyu, Zhaoyang and Zheng, Feng and Ma, Lizhuang and Pang, Jiangmiao}, journal={arXiv preprint arXiv:2509.22281}, year={2025} }

MesaTask Towards Task-Driven Tabletop Scene Generation
via 3D Spatial Reasoning

💡Abstract

MesaTask generates 3D tabletop scenes directly from human instructions. To support this task, we construct a large-scale dataset of tabletop scenes, MesaTask-10K.