Cognition and State Modeling

STM

Maintains a space-time feature memory so a VLM can recover target appearance under deformation, blur, or lighting changes.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Use STM when a target has changed appearance but earlier clean frames can still anchor the current prediction.

InputCurrent frame features + memory bank query
OutputMemory readout, attention map, confidence score
Trigger TimingTriggered on demand after the required input files and configuration are prepared.
RuntimeLocal GPU
BeforeCurrent frame features + memory bank query

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterMemory readout, attention map, confidence score

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/stm/examples/mock_query_img.png
Prompttarget_object_id: selected object; memory: reference image and mask
ExpectedA predicted foreground region, bounding box, feature readout, attention weights, and confidence score.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

ref_imgfiletools/stm/examples/mock_ref_img.png

Reference frame where the target is clearly observed.

ref_maskfiletools/stm/examples/mock_ref_mask.png

Target mask associated with the reference frame.

query_imgfiletools/stm/examples/mock_query_img.png

Current frame to segment or verify.

weightspathtools/stm/weights/STM_weights.pth

STM model weights used by the local backend.

Output Explanation

retrieved_feature_map

Memory-conditioned feature map used for the current frame.

visual_attention_weights

Attention weights over stored space-time memory.

confidence_score

Confidence of the current target prediction.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Deployment Notes

  1. Clone or download the official STM implementation.
  2. Install the PyTorch dependencies and place weights under tools/stm/weights/.
  3. Prepare reference images, masks, and query frames using repository-relative paths.
  4. Run the wrapper and save masks, boxes, and attention outputs under tools/stm/runs/.

Relative Path Example

python tools/stm/run.py --ref-img tools/stm/examples/mock_ref_img.png --ref-mask tools/stm/examples/mock_ref_mask.png --query-img tools/stm/examples/mock_query_img.png --weights tools/stm/weights/STM_weights.pth --output tools/stm/runs/prediction.json

Expected Result Shape

{
  "tool": "stm",
  "status": "ok",
  "results": [
    {
      "label": "Space-time visual memory",
      "score": 0.87,
      "output": "Memory readout, attention map, confidence score"
    }
  ],
  "timing": {
    "runtime": "The official paper reports 0.16 s/frame on DAVIS-2016; the submitted spreadsheet separately described about 20 ms for the local wrapper.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/stm/runs/visualization.png",
    "raw_predictions": "tools/stm/runs/predictions.json"
  }
}
Paper figure

Academic Info

Paper identity and contribution summary.

TitleVideo Object Segmentation using Space-Time Memory Networks
AuthorsAdd authors
VenueICCV 2019
ContributionReads and writes a space-time memory bank so target features from earlier frames can guide later-frame segmentation and tracking.

Citation

@misc{stm2019,
  title={Video Object Segmentation using Space-Time Memory Networks},
  author={Author},
  year={2019},
  note={ICCV 2019},
  url={https://arxiv.org/abs/1904.00607}
}

Benchmark

Only compact, source-reported numbers are shown here.

DatasetMetricValueRuntimeSource
YouTube-VOS validationOverall / seen J / seen F / unseen J / unseen F79.4 / 79.7 / 84.2 / 72.8 / 80.9Official STM evaluationOfficial ICCV 2019 paper, Table 1
DAVIS-2016 validationJ / F88.7 / 89.9 with YouTube-VOS pretraining0.16 s/frameOfficial ICCV 2019 paper, Table 2
DAVIS-2017 validationJ / F79.2 / 84.3 with YouTube-VOS pretrainingOfficial STM evaluationOfficial ICCV 2019 paper, Table 3

Artifacts

Official ICCV 2019 paper, repository link, weights path, mock input JSON, and local prediction output from the submitted spreadsheet.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.