Cognition and State Modeling

Action Genome

Uses spatio-temporal scene graph structure to infer object states and contact relations from short video clips.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Use Action Genome-style state modeling when the planner needs a concrete physical state instead of a static-frame guess.

InputShort video clip + target object list
OutputState graph and temporal relations
Trigger TimingTriggered on demand after the required input files and configuration are prepared.
RuntimeLocal GPU
BeforeShort video clip + target object list

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterState graph and temporal relations

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/action-genome/examples/mock_video.mp4
Promptobjects_of_interest: door, cup, person
ExpectedObject state timelines and contact relations such as open, closed, empty, full, sitting, or standing.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

temporal_video_bufferfiletools/action-genome/examples/mock_video.mp4

Short clip or buffered frames used for temporal state inference.

objects_of_interesttext

Object list whose states and relations should be tracked.

feature_tensorpath

Optional precomputed features for the scene graph model.

bbox_tensorpath

Optional detected object boxes aligned with the video frames.

Output Explanation

object_states

Frame spans labeled with object states.

contact_relations

Temporal relations between objects and actors.

state_timeline

Ordered state transitions that downstream planners can audit.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Deployment Notes

  1. Clone or download the official Action Genome resources.
  2. Prepare video clips, object boxes, and model features in the expected format.
  3. Run the state-graph wrapper with repository-relative video and object paths.
  4. Save state timelines and relation graphs under tools/action-genome/runs/.

Relative Path Example

python tools/action-genome/run.py --video tools/action-genome/examples/mock_video.mp4 --objects tools/action-genome/examples/objects.json --output tools/action-genome/runs/state_timeline.json

Expected Result Shape

{
  "tool": "action-genome",
  "status": "ok",
  "scene_state": [
    {
      "label": "Spatio-temporal scene graph state modeling",
      "score": 0.87,
      "output": "State graph and temporal relations"
    }
  ],
  "timing": {
    "runtime": "The submitted wrapper is described as interactive; the official paper focuses on dataset/task metrics rather than wall-clock wrapper latency.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/action-genome/runs/visualization.png",
    "raw_predictions": "tools/action-genome/runs/predictions.json"
  }
}
Paper figure

Academic Info

Paper identity and contribution summary.

TitleAction Genome: Actions as Compositions of Spatio-temporal Scene Graphs
AuthorsAdd authors
VenueCVPR 2020
ContributionRepresents actions as evolving object relations, reducing static-image guesses about physical state during manipulation tasks.

Citation

@misc{actiongenome2020,
  title={Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs},
  author={Author},
  year={2020},
  note={CVPR 2020},
  url={https://arxiv.org/abs/1912.06992}
}

Benchmark

Only compact, source-reported numbers are shown here.

DatasetMetricValueRuntimeSource
Action Genome / CharadesDataset scale10K videos, 0.4M objects, 1.7M visual relationshipsOfficial dataset annotation scaleOfficial CVPR 2020 paper
Action Genome / Charades few-shot action recognitionmAP with 10 examples42.7%Few-shot action recognition experimentOfficial CVPR 2020 paper
Action Genome detailed annotationFrame/object/relation coverage234K video frames, 476K object bounding boxes, 1.72M relationships, 157 action categoriesOfficial dataset statisticsOfficial CVPR 2020 paper

Artifacts

Official CVPR 2020 paper, project page, repository link, mock video input, feature tensor shape, bbox tensor shape, and state timeline output.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.