MemoBench

Benchmarking World Modeling in Dynamically Changing Environments

1Harvard University 2MIT 3MIT-IBM Watson AI Lab 4Boston University 5Google 6Johns Hopkins University 7Carnegie Mellon University
ECCV 2026
MemoBench Teaser

Teaser: MemoBench evaluates memory consistency in world generation models through a disappear-and-reappear paradigm.

Qualitative Comparisons

Compare generated videos across all 8 models. Use the arrows, keyboard, or clip bar to browse scenes.

Overview

MemoBench is a diagnostic benchmark for evaluating memory consistency in world generation models. Each clip follows a disappear-and-reappear structure: a target object is visible, the camera pans away causing it to disappear, and then the camera returns, requiring the model to faithfully recover the object's appearance, position, and state.

360
Ground-truth video clips
1080p
Resolution (60 FPS syn / 30 FPS real)
8
Models benchmarked
10
Automated metrics
4
VQA dimensions

Three-Phase Paradigm

Every benchmark clip is divided into three phases:

  • Visible (V): Target object is fully in view.
  • Disappeared (D): Camera has panned away; target is completely out of the field of view.
  • Reappear (R): Camera returns; the model must recover the target's updated state.

Dataset

196
Synthetic clips (14 scenes, 5 categories)
164
Real-world clips (30 processes, 7 categories)

Synthetic Data

Rendered in Unreal Engine 5 with diverse 3D scenes and animated target objects. Includes per-frame RGB, metric depth, camera intrinsics, and camera-to-world poses.

Synthetic dataset overview: 14 scenes, 5 categories

Synthetic data: 14 scenes across 5 categories with UE5-rendered ground truth.

Real-World Data

Captured in controlled indoor settings covering 30 physical-state-change processes (e.g., dissolving, melting, pouring) across categories that depend on viscosity, elasticity, and thermal conductivity.

Real-world dataset overview: 30 processes, 7 categories

Real-world data: 30 processes across 7 categories captured in controlled indoor settings.

Evaluation Metrics

We combine automated metrics with VQA-based evaluation to capture both low-level fidelity and high-level semantic correctness.

General Video Quality

Visual Quality — AestheticScore + CLIP-IQA+ averaged (0–100)

Motion Smoothness — RAFT optical-flow warp stability (V+R phases)

Object Identity Consistency — DINOv2 patch-token similarity, first frame vs. R-phase

Geo3D Consistency — Depth Anything V2 cosine similarity between consecutive depth maps

Memory-Specific Metrics

Object Reappearance Score (ORS) — SAM-3 text-prompted detection rate × confidence in R-phase

Pixel-Level Fidelity — PSNR, SSIM, LPIPS against ground-truth video

Camera Controllability — ATE rotation RMSE via MapAnything pose estimation

VQA Dimensions

Instruction Following — Does the video execute spatiotemporal instructions from the prompt?

Object & Background Consistency — Are foreground/background elements stable across frames?

Continuity of Memory — Does the model maintain object identity after disappearance?

Physics Adherence — Is locomotion, gravity, lighting physically plausible?

VQA Pipeline

VQA evaluation pipeline: for each clip, we generate dimension-specific questions, filter against ground-truth video and failure cases, then query a VLM to produce per-dimension scores.

Detailed VQA Pipeline

Walk through the three filtering stages on real example clips. Toggle between scenes and advance through the pipeline stages.

Automated Evaluation Results

See our for the latest ranking and numerical results.
Model VisQual ↑ MotSmooth ↑ ObjConsist ↑ 3DConsist ↑ ORS ↑ PSNR ↑ SSIM ↑ LPIPS ↓ CamCtrl ↑ ImgReward ↑
CI2V Models
LingBot-World47.457.659.088.2 0.38114.410.490 0.48237.436.7
Wan2.240.054.050.784.5 0.32813.760.4690.52929.826.1
FantasyWorld51.055.247.688.7 0.27613.230.4270.57127.230.7
3D-based Models
Matrix-Game 2.061.283.646.593.7 0.15713.490.3760.55017.322.3
Stable Virtual Camera43.363.159.588.5 0.29415.360.5230.455 65.222.3
I2V Models
Open-SoRA49.768.347.289.7 0.18212.540.3840.56616.831.3
LTX-Video44.984.481.694.1 0.33013.420.4550.51817.137.1
CogVideoX40.159.854.094.0 0.25112.070.4800.59212.034.9

VQA-based Evaluation Results

Model Inst.Fol. ↑ Obj.&Bkg. ↑ Cont.Mem. ↑ Phys.Adh. ↑
CI2V Models
LingBot-World64.244.442.153.6
Wan2.250.630.236.838.9
FantasyWorld50.525.637.133.6
HunyuanWorldPlay61.666.455.663.6
HunyuanGameCraft41.671.648.461.0
3D-based Models
Matrix-Game 2.037.512.736.521.8
Stable Virtual Camera49.723.829.633.3
I2V Models
Open-SoRA43.266.848.359.7
LTX-Video41.076.657.063.5
CogVideoX40.552.442.742.8

Model Performance Overview

Radar charts summarizing per-model strengths across automated metrics and VQA dimensions.

Automated Metrics Radar Chart

Automated metrics (7 axes, normalized). Higher area = stronger overall performance.

VQA Dimensions Radar Chart

VQA dimensions (4 axes). CI2V models lead Instruction Following; I2V models lead other dimensions via camera inactivity.

Human Validation

We validate our VQA pipeline by comparing VLM-generated answers with human judgments from 30 respondents (Ph.D.-level researchers and experienced AI engineers) on 96 questions across 12 scenes.

92.9%
Overall agreement rate
κ = 0.85
Cohen's Kappa (majority vote)
30
Human responses collected
96
Validated questions
Human-VLM Agreement Heatmap

Per-scene, per-dimension agreement rate between human majority answers and VLM-generated ground-truth answers.