Background Image

MesaTask Towards Task-Driven Tabletop Scene Generation
via 3D Spatial Reasoning

Jinkun Hao1*, Naifu Liang2*, Zhen luo3,4*, Xudong Xu2‡, Weipeng Zhong2, Ran Yi1, Yichen Jin5, Zhaoyang Lyu2, Feng Zheng4, Lizhuang Ma1✉️, Jiangmiao Pang2
1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3SII, 4Southern University of Science and Technology, 5Peking University
* equal contribution, ‡ project lead, ✉️ corresponding author
NeurIPS 2025 Spotlight

💡Abstract

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.

MesaTask Teaser Image

MesaTask generates 3D tabletop scenes directly from human instructions. To support this task, we construct a large-scale dataset of tabletop scenes, MesaTask-10K.

🎨 Interactive Scene Gallery

3D Scene Viewer

📋 Task:

Loading task description...

Tip

Loading 3D Scene...

Object Information

Tip

Click on objects in the 3D scene to view their details

Select Scene

🤖 Manipulation Demo

Demonstration of robotic manipulation on our generated tabletop scenes.

📍 Task 1: Place the banana on the plate.

Banana task view 1
View 1
Banana task view 2

📍 Task 2: Organize the office table by moving the mouse to the middle.

Mouse task view 1
View 1
Mouse task view 2
View 2

Our MesaTask framework generates sim-ready tabletop scenes that support robotic manipulation tasks.

🗂️ Data Curation

Dataset Overview

The dataset construction pipeline. First, an LLM is used to generate diverse tabletop scene descriptions, including relevant object lists and spatial relations. Conditioned on the scene description, a text-to-image model synthesizes reference images, from which coarse 3D layouts are built using depth estimation, object detection, and 3D asset retrieval. These layouts are refined through human annotations and physical simulation to ensure spatial plausibility, yielding high-quality 3D layouts.

⚙️ Method

Dataset Overview

🔍 Overview of our MesaTask Framework

1) Task-to-Scene Generation (upper-left)

Given a task instruction, we extract detailed task information including environment, sub-goals, and task-relevant objects. A structured spatial reasoning chain performs object list completion, interrelation inference, and scene graph construction, which guides the generation of 3D layouts. Final scenes are obtained via 3D asset retrieval.

2) Reasoning Data Construction (bottom)

Based on scene graphs and descriptions of our MesaTask-10K dataset, a multimodal LLM is leveraged to produce task instructions, detailed task information, and complete object lists and interrelations.

3) DPO Data Construction (upper right)

To enable DPO training, we generate negative examples by randomly perturbing object positions or relations and removing key objects from normal layouts.

BibTeX

# here is the BibTeX