Furniture assembly as a spatio-temporal stress test

FLAT-PACKBENCH

Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang,
Noah Snavely, Bharath Hariharan

✨ CVPR 2026 ✨

Overview

Assembly videos expose gaps in visual temporal reasoning.

Flat-Pack Bench is a multiple-choice benchmark for fine-grained spatio-temporal understanding. Each question asks a model to reason about parts, contact events, assembly order, final connectivity, or object identity across time in real furniture assembly videos.

Benchmark

Four question families target complementary assembly skills.

TLOC 103 questions

Temporal Localization

Identify which connection happened first, last, next, or most recently relative to an assembly state.

TORD 155 questions

Temporal Ordering

Recover the exact order in which parts or part-pairs become physically connected.

MATE 87 questions

Mating

Determine whether two labeled parts are connected in the fully assembled furniture.

TRACK 257 questions

Tracking

Match labeled parts across two states, requiring object identity through occlusion and motion.

Prompt construction

Questions combine a video with visual prompts that label the relevant parts. Non-tracking questions use one annotated image, while tracking questions use an Image A to Image B correspondence prompt.

Dataset Viewer

Switch between curated examples and the full benchmark browser.

View

Video

0 curated samples Use arrow keys

Evaluation

Models are scored as multiple-choice video question answerers.

Inputs

The evaluation compares keyframe videos, trimmed videos, and visual prompt construction variants such as separate images, collages, and concatenated prompt media.

Outputs

Models answer with an option label. We report micro-average accuracy and per-category accuracy for TORD, TLOC, TRACK, and MATE.

Results

Current LVLMs remain far below human performance.

Human 94.18

micro-average accuracy

Open models 41.03

InternVL3-78B in the paper's main benchmark setting

Closed models 37.71

OpenAI GPT-5 in the paper's main benchmark setting

Citation

BibTeX

TBA