Kinema4D:
Kinematic4D World Modeling for Spatiotemporal Embodied Simulation

1S-Lab, Nanyang Technological University    2SSE, CUHKSZ   
Corresponding Author
MY ALT TEXT

We propose Kinema4D, a new action-conditioned 4D generative robotic simulator. Given an initial world image with a robot at a canonical setup space, and an action sequence, our method generates future robot-world interactions in 4D space.

Demo video of our work.

Abstract

Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls : we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions : we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments’ reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.


Method Overview

MY ALT TEXT

1) Kinematics Control: Given a 3D robot with its URDF at initial canonical setup space, and an action sequence, we drive the 3D robot via kinematics to produce a 4D robot trajectory, which is then projected into a pointmap sequence. This process re-represents raw actions as a spatiotemporal visual signal. 2) 4D Generative World Modeling: This signal and the initial main-view world image are sent to a shared VAE encoder, then fused with an occupancy-aligned robot mask and noise, which are denoised by a Diffusion Transformerto generate a full future 4D (pointmap+RGB) world sequence.


Dataset Overview

MY ALT TEXT

Our dataset provides a comprehensive data foundation by aggregating diverse real-world demonstrations, including DROID, Bridge, and RT-1. We further incorporate the LIBERO to synthesize a vast array of successful/failure cases. Each episode captures a complete robot-world interaction (e.g., pick-and-place)—providing the continuous information necessary for robust reasoning. The 4D point clouds viewed from various camera frustums are shown here, demonstrating the spatial precision of our pseudo-annotations.

4D Qualitative comparison with TesserAct [ICCV 2025]


1. Simulation of successful task completion. Unlike TesserAct that hallucinate outcomes, our method precisely reflects Ground-Truth (GT) executions, producing physically-plausible outcomes:

Ours
GT
TesserAct

Ours and GT manipulate same object => similar outcome; TesserAct generates non-existing object => wrong outcome.

Ours
GT
TesserAct

Ours and GT contact/manipulate => similar outcome; TesserAct no contact but manipulate => wrong outcome.

Ours
GT
TesserAct

Ours and GT grasp/manipulate same object => similar outcome; TesserAct grasps two objects => wrong outcome.

Ours
GT
TesserAct

Ours and GT manipulate same object => similar outcome; TesserAct generates non-existing object => wrong outcome.


2. Simulation of failed task completion (More interesting/reasonable!!!). Our method precisely reflects "near-miss" failure cases.
Specifically, our model correctly interprets the spatial gap between the gripper and the object, even when their RGB textures overlap in 2D views.

Ours
GT
TesserAct

Ours and GT all failed to grasp the object; TesserAct still manipulates the object.

Ours
GT
TesserAct

Ours and GT all failed to grasp the object; TesserAct still manipulates the object.

Ours
GT
TesserAct

Ours and GT all failed to grasp the object; TesserAct still manipulates the object and generates the illusion.

Ours
GT
TesserAct

Ours and GT all failed to grasp the object; TesserAct still manipulates the object.

Qualitative results of Policy Evaluation.


1. Evaluation in real-world environment (First-time out-of-distribution test !!!).
To intuitively compare with GT, we also reconstruct the GT and present both 4D and 2D results.
Our results align with real outcomes, accurately synthesizing both successful rollouts and "near-miss" failure.

Ours (4D)
Our (2D)
GT (2D)
GT (4D)

Successful task completion rollouts.

Ours (4D)
Our (2D)
GT (2D)
GT (4D)

"Near-miss" failure cases. Our model correctly interprets the spatial gap between the gripper and the object, even when their RGB textures overlap in 2D views.


2. Evaluation in LIBERO simulation environment.
Our method generates similar outcome as Ground-Truth for both successful and "near-miss" failed task completion.

Ours
GT

Successfully pick and place.

Ours
GT

Successfully pick and place.

Ours
GT

Successfully pick and place.

Ours
GT

Failed task completion.

Ours
GT

Failed task completion.

Ours
GT

Failed task completion.

Extensive Results Showcase


Ours method simulates physically plausible and geometrically consistent interactions,
between complex robot actions and diverse objects,
across various spatial constraints and different embodiments.

Ours
GT

Drag a deformable object.

Ours
GT

Pick and place under complex spatial constriants.

Ours
GT

Pick and place of a small object.

Ours
GT

Pick and place on another object.

Ours
GT

Drag a large deformable object.

Ours
GT

Fold a deformable object.

Ours
GT

Pick and move into closer view.

Ours
GT

Subtle contact and open a shelf.

Ours
GT

Subtle pick and place a tiny object.

Ours
GT

Open the door to a new world.

Ours
GT

Close a microwave oven.

Ours
GT

Close a microwave oven.

Ours
GT

Pick and place a small object.

Ours
GT

Pick and place a small object.

Ours
GT

All failed grasp the object.

Ours
GT

Pick and place a small object.

Ours
GT

Pick and place a small object.

Ours
GT

Pick and place a small object.

Ours
GT

Open a drawer.

Ours
GT

Pick and place a transparent object.

Ours
GT

Pick and place a small object.

BibTeX


@article{xu2026kinema4d,
title={Kinema4D: Kinematic4D World Modeling for Spatiotemporal Embodied Simulation},
author={Xu, Mutian and Zhang, Tianbao and Liu, Tianqi and Chen, Zhaoxi and Han, Xiaoguang and Liu, Ziwei},
journal={arXiv preprint arXiv:2603.16669},
year={2026}
}