Perception and Grounding

Cutie

Cutie is a video object segmentation framework that improves consistency, robustness, and speed while supporting scripting and interactive GUI workflows.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Provide a video and an initial object mask, then Cutie propagates the object mask through later frames for video object segmentation.

InputVideo frames + initial mask

OutputTracked object masks

Trigger TimingTriggered on demand after the required input files and configuration are prepared.

RuntimePython / PyTorch / interactive GUI

BeforeVideo frames + initial mask

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterTracked object masks

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/cutie/examples/images/ + tools/cutie/examples/masks/00000.png

PromptTrack the selected foreground object through the sequence

ExpectedA folder of per-frame masks and overlay previews for the tracked object.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

videofile

Input video or ordered frame folder.

initial_maskfile

First-frame mask that defines the object identity to propagate.

num_objectsnumber1

Number of object identities tracked in the interactive demo.

output_dirpath

Destination for masks, overlays, and logs.

Output Explanation

mask

Per-frame segmentation mask for each tracked object.

object_id

Stable identity label assigned to the object across the sequence.

overlay

Preview image showing the mask on top of the video frame.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/hkchengrex/Cutie Code Downloadhttps://github.com/hkchengrex/Cutie/archive/refs/heads/main.zip Project Pagehttps://hkchengrex.github.io/Cutie/arXivhttps://arxiv.org/abs/2310.12982 Model Weightshttps://github.com/hkchengrex/Cutie#download-the-model Interactive Demohttps://github.com/hkchengrex/Cutie#interactive-demo Colabhttps://colab.research.google.com/drive/1yo43XTbjxuWA7XgCUO9qxAi7wBI6HzvP?usp=sharing

Deployment Notes

Install the official Cutie environment and download pretrained weights with the repository script.
Prepare frames and first-frame masks using the example folder structure.
Use scripting_demo.py for reproducible examples or interactive_demo.py for manual annotation workflows.
Save the propagated masks under tools/cutie/examples or a dedicated runs folder.

Relative Path Example

# Relative-path local entry for the Cutie tool folder
python tools/cutie/scripting_demo.py

# Add/delete object workflow:
python tools/cutie/scripting_demo_add_del_objects.py

# Interactive GUI:
python tools/cutie/interactive_demo.py   --video tools/cutie/examples/example.mp4   --num_objects 1

# Suggested repository layout:
# tools/cutie/README.md
# tools/cutie/scripting_demo.py
# tools/cutie/interactive_demo.py
# tools/cutie/examples/images/
# tools/cutie/examples/masks/

# This page documents the path. The static page does not execute Cutie.

Expected Result Shape

{
  "tool": "cutie",
  "status": "ok",
  "masks": [
    {
      "label": "Video object segmentation",
      "score": 0.87,
      "output": "Tracked object masks"
    }
  ],
  "timing": {
    "runtime": "Cutie-base reports 36.4 FPS on V100; Cutie-small with MOSE training reports 45.5 FPS. The paper states +8.7 J&F over XMem and +4.2 J&F over DeAOT on MOSE while being 3x faster than DeAOT.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/cutie/runs/visualization.png",
    "raw_predictions": "tools/cutie/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitlePutting the Object Back into Video Object Segmentation

AuthorsHo Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing

VenueCVPR 2024 Highlight / arXiv:2310.12982

ContributionAdds a stronger object-centric memory design for video object segmentation, improving temporal consistency and interactive control over previous XMem-style pipelines.

Citation

@misc{cutie2024,
  title={Putting the Object Back into Video Object Segmentation},
  author={Ho Kei Cheng and Seoung Wug Oh and Brian Price and Joon-Young Lee and Alexander Schwing},
  year={2024},
  note={CVPR 2024 Highlight / arXiv:2310.12982},
  url={https://arxiv.org/abs/2310.12982}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
MOSE validation	J&F	68.3 for Cutie-base with MOSE training	36.4 FPS on V100	CVPR 2024 paper
DAVIS-2017 / YouTubeVOS-2019	J&F / G	DAVIS val 88.8, DAVIS test 85.3, YouTubeVOS G 86.5	Cutie-small 45.5 FPS	Cutie paper

Artifacts

Cutie paper, MOSE/DAVIS/YouTubeVOS tables, scripting demo, interactive GUI, pretrained model download script, example frames, and masks.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.