MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

Overview

MemoBench is a diagnostic benchmark for evaluating memory consistency in world generation models. Each clip follows a disappear-and-reappear structure: a target object is visible, the camera pans away causing it to disappear, and then the camera returns, requiring the model to faithfully recover the object's appearance, position, and state.

360

Ground-truth video clips

1080p

Resolution (60 FPS syn / 30 FPS real)

Models benchmarked

Automated metrics

VQA dimensions

Three-Phase Paradigm

Every benchmark clip is divided into three phases:

Visible (V): Target object is fully in view.
Disappeared (D): Camera has panned away; target is completely out of the field of view.
Reappear (R): Camera returns; the model must recover the target's updated state.

Dataset

196

Synthetic clips (14 scenes, 5 categories)

164

Real-world clips (30 processes, 7 categories)

Synthetic Data

Rendered in Unreal Engine 5 with diverse 3D scenes and animated target objects. Includes per-frame RGB, metric depth, camera intrinsics, and camera-to-world poses.

Synthetic data: 14 scenes across 5 categories with UE5-rendered ground truth.

Real-World Data

Captured in controlled indoor settings covering 30 physical-state-change processes (e.g., dissolving, melting, pouring) across categories that depend on viscosity, elasticity, and thermal conductivity.

Real-world data: 30 processes across 7 categories captured in controlled indoor settings.

Evaluation Metrics

We combine automated metrics with VQA-based evaluation to capture both low-level fidelity and high-level semantic correctness.

General Video Quality

Visual Quality — AestheticScore + CLIP-IQA+ averaged (0–100)

Motion Smoothness — RAFT optical-flow warp stability (V+R phases)

Object Identity Consistency — DINOv2 patch-token similarity, first frame vs. R-phase

Geo3D Consistency — Depth Anything V2 cosine similarity between consecutive depth maps

Memory-Specific Metrics

Object Reappearance Score (ORS) — SAM-3 text-prompted detection rate × confidence in R-phase

Pixel-Level Fidelity — PSNR, SSIM, LPIPS against ground-truth video

Camera Controllability — ATE rotation RMSE via MapAnything pose estimation

VQA Dimensions

Instruction Following — Does the video execute spatiotemporal instructions from the prompt?

Object & Background Consistency — Are foreground/background elements stable across frames?

Continuity of Memory — Does the model maintain object identity after disappearance?

Physics Adherence — Is locomotion, gravity, lighting physically plausible?

VQA evaluation pipeline: for each clip, we generate dimension-specific questions, filter against ground-truth video and failure cases, then query a VLM to produce per-dimension scores.

Detailed VQA Pipeline

Walk through the three filtering stages on real example clips. Toggle between scenes and advance through the pipeline stages.

Automated Evaluation Results

See our

for the latest ranking and numerical results.

Model	VisQual ↑	MotSmooth ↑	ObjConsist ↑	3DConsist ↑	ORS ↑	PSNR ↑	SSIM ↑	LPIPS ↓	CamCtrl ↑	ImgReward ↑
CI2V Models
LingBot-World	47.4	57.6	59.0	88.2	0.381	14.41	0.490	0.482	37.4	36.7
Wan2.2	40.0	54.0	50.7	84.5	0.328	13.76	0.469	0.529	29.8	26.1
FantasyWorld	51.0	55.2	47.6	88.7	0.276	13.23	0.427	0.571	27.2	30.7
3D-based Models
Matrix-Game 2.0	61.2	83.6	46.5	93.7	0.157	13.49	0.376	0.550	17.3	22.3
Stable Virtual Camera	43.3	63.1	59.5	88.5	0.294	15.36	0.523	0.455	65.2	22.3
I2V Models
Open-SoRA	49.7	68.3	47.2	89.7	0.182	12.54	0.384	0.566	16.8	31.3
LTX-Video	44.9	84.4	81.6	94.1	0.330	13.42	0.455	0.518	17.1	37.1
CogVideoX	40.1	59.8	54.0	94.0	0.251	12.07	0.480	0.592	12.0	34.9

VQA-based Evaluation Results

Model	Inst.Fol. ↑	Obj.&Bkg. ↑	Cont.Mem. ↑	Phys.Adh. ↑
CI2V Models
LingBot-World	64.2	44.4	42.1	53.6
Wan2.2	50.6	30.2	36.8	38.9
FantasyWorld	50.5	25.6	37.1	33.6
HunyuanWorldPlay	61.6	66.4	55.6	63.6
HunyuanGameCraft	41.6	71.6	48.4	61.0
3D-based Models
Matrix-Game 2.0	37.5	12.7	36.5	21.8
Stable Virtual Camera	49.7	23.8	29.6	33.3
I2V Models
Open-SoRA	43.2	66.8	48.3	59.7
LTX-Video	41.0	76.6	57.0	63.5
CogVideoX	40.5	52.4	42.7	42.8

Model Performance Overview

Radar charts summarizing per-model strengths across automated metrics and VQA dimensions.

Automated metrics (7 axes, normalized). Higher area = stronger overall performance.

VQA dimensions (4 axes). CI2V models lead Instruction Following; I2V models lead other dimensions via camera inactivity.

Human Validation

We validate our VQA pipeline by comparing VLM-generated answers with human judgments from 30 respondents (Ph.D.-level researchers and experienced AI engineers) on 96 questions across 12 scenes.

92.9%

Overall agreement rate

κ = 0.85

Cohen's Kappa (majority vote)

Human responses collected

Validated questions

Per-scene, per-dimension agreement rate between human majority answers and VLM-generated ground-truth answers.

MemoBench

Qualitative Comparisons

Barnyard (Synthetic)