Cognition and State Modeling

Action Genome

Uses spatio-temporal scene graph structure to infer object states and contact relations from short video clips.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Use Action Genome-style state modeling when the planner needs a concrete physical state instead of a static-frame guess.

InputShort video clip + target object list

OutputState graph and temporal relations

Trigger TimingTriggered on demand after the required input files and configuration are prepared.

RuntimeLocal GPU

BeforeShort video clip + target object list

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterState graph and temporal relations

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/action-genome/examples/mock_video.mp4

Promptobjects_of_interest: door, cup, person

ExpectedObject state timelines and contact relations such as open, closed, empty, full, sitting, or standing.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

temporal_video_bufferfiletools/action-genome/examples/mock_video.mp4

Short clip or buffered frames used for temporal state inference.

objects_of_interesttext

Object list whose states and relations should be tracked.

feature_tensorpath

Optional precomputed features for the scene graph model.

bbox_tensorpath

Optional detected object boxes aligned with the video frames.

Output Explanation

object_states

Frame spans labeled with object states.

contact_relations

Temporal relations between objects and actors.

state_timeline

Ordered state transitions that downstream planners can audit.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/JingweiJ/ActionGenome Code Downloadhttps://github.com/JingweiJ/ActionGenome/archive/refs/heads/main.zip Project Pagehttps://www.actiongenome.org/Paperhttps://arxiv.org/abs/1912.06992

Deployment Notes

Clone or download the official Action Genome resources.
Prepare video clips, object boxes, and model features in the expected format.
Run the state-graph wrapper with repository-relative video and object paths.
Save state timelines and relation graphs under tools/action-genome/runs/.

Relative Path Example

python tools/action-genome/run.py --video tools/action-genome/examples/mock_video.mp4 --objects tools/action-genome/examples/objects.json --output tools/action-genome/runs/state_timeline.json

Expected Result Shape

{
  "tool": "action-genome",
  "status": "ok",
  "scene_state": [
    {
      "label": "Spatio-temporal scene graph state modeling",
      "score": 0.87,
      "output": "State graph and temporal relations"
    }
  ],
  "timing": {
    "runtime": "The submitted wrapper is described as interactive; the official paper focuses on dataset/task metrics rather than wall-clock wrapper latency.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/action-genome/runs/visualization.png",
    "raw_predictions": "tools/action-genome/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitleAction Genome: Actions as Compositions of Spatio-temporal Scene Graphs

AuthorsAdd authors

VenueCVPR 2020

ContributionRepresents actions as evolving object relations, reducing static-image guesses about physical state during manipulation tasks.

Citation

@misc{actiongenome2020,
  title={Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs},
  author={Author},
  year={2020},
  note={CVPR 2020},
  url={https://arxiv.org/abs/1912.06992}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
Action Genome / Charades	Dataset scale	10K videos, 0.4M objects, 1.7M visual relationships	Official dataset annotation scale	Official CVPR 2020 paper
Action Genome / Charades few-shot action recognition	mAP with 10 examples	42.7%	Few-shot action recognition experiment	Official CVPR 2020 paper
Action Genome detailed annotation	Frame/object/relation coverage	234K video frames, 476K object bounding boxes, 1.72M relationships, 157 action categories	Official dataset statistics	Official CVPR 2020 paper

Artifacts

Official CVPR 2020 paper, project page, repository link, mock video input, feature tensor shape, bbox tensor shape, and state timeline output.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.