Temporal Localization
Identify which connection happened first, last, next, or most recently relative to an assembly state.
Furniture assembly as a spatio-temporal stress test
Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Overview
Flat-Pack Bench is a multiple-choice benchmark for fine-grained spatio-temporal understanding. Each question asks a model to reason about parts, contact events, assembly order, final connectivity, or object identity across time in real furniture assembly videos.
Benchmark
Identify which connection happened first, last, next, or most recently relative to an assembly state.
Recover the exact order in which parts or part-pairs become physically connected.
Determine whether two labeled parts are connected in the fully assembled furniture.
Match labeled parts across two states, requiring object identity through occlusion and motion.
Questions combine a video with visual prompts that label the relevant parts. Non-tracking questions use one annotated image, while tracking questions use an Image A to Image B correspondence prompt.
Dataset Viewer
Evaluation
The evaluation compares keyframe videos, trimmed videos, and visual prompt construction variants such as separate images, collages, and concatenated prompt media.
Models answer with an option label. We report micro-average accuracy and per-category accuracy for TORD, TLOC, TRACK, and MATE.
Results
micro-average accuracy
InternVL3-78B in the paper's main benchmark setting
OpenAI GPT-5 in the paper's main benchmark setting
| Model | Prompt | Video | Micro | 95% CI | TORD | TLOC | TRACK | MATE |
|---|
Citation
TBA