Cognition and State Modeling

STM

Maintains a space-time feature memory so a VLM can recover target appearance under deformation, blur, or lighting changes.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Use STM when a target has changed appearance but earlier clean frames can still anchor the current prediction.

InputCurrent frame features + memory bank query

OutputMemory readout, attention map, confidence score

Trigger TimingTriggered on demand after the required input files and configuration are prepared.

RuntimeLocal GPU

BeforeCurrent frame features + memory bank query

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterMemory readout, attention map, confidence score

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/stm/examples/mock_query_img.png

Prompttarget_object_id: selected object; memory: reference image and mask

ExpectedA predicted foreground region, bounding box, feature readout, attention weights, and confidence score.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

ref_imgfiletools/stm/examples/mock_ref_img.png

Reference frame where the target is clearly observed.

ref_maskfiletools/stm/examples/mock_ref_mask.png

Target mask associated with the reference frame.

query_imgfiletools/stm/examples/mock_query_img.png

Current frame to segment or verify.

weightspathtools/stm/weights/STM_weights.pth

STM model weights used by the local backend.

Output Explanation

retrieved_feature_map

Memory-conditioned feature map used for the current frame.

visual_attention_weights

Attention weights over stored space-time memory.

confidence_score

Confidence of the current target prediction.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/seoungwugoh/STM Code Downloadhttps://github.com/seoungwugoh/STM/archive/refs/heads/main.zip Project Pagehttps://seoungwugoh.github.io/STM/Paperhttps://arxiv.org/abs/1904.00607

Deployment Notes

Clone or download the official STM implementation.
Install the PyTorch dependencies and place weights under tools/stm/weights/.
Prepare reference images, masks, and query frames using repository-relative paths.
Run the wrapper and save masks, boxes, and attention outputs under tools/stm/runs/.

Relative Path Example

python tools/stm/run.py --ref-img tools/stm/examples/mock_ref_img.png --ref-mask tools/stm/examples/mock_ref_mask.png --query-img tools/stm/examples/mock_query_img.png --weights tools/stm/weights/STM_weights.pth --output tools/stm/runs/prediction.json

Expected Result Shape

{
  "tool": "stm",
  "status": "ok",
  "results": [
    {
      "label": "Space-time visual memory",
      "score": 0.87,
      "output": "Memory readout, attention map, confidence score"
    }
  ],
  "timing": {
    "runtime": "The official paper reports 0.16 s/frame on DAVIS-2016; the submitted spreadsheet separately described about 20 ms for the local wrapper.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/stm/runs/visualization.png",
    "raw_predictions": "tools/stm/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitleVideo Object Segmentation using Space-Time Memory Networks

AuthorsAdd authors

VenueICCV 2019

ContributionReads and writes a space-time memory bank so target features from earlier frames can guide later-frame segmentation and tracking.

Citation

@misc{stm2019,
  title={Video Object Segmentation using Space-Time Memory Networks},
  author={Author},
  year={2019},
  note={ICCV 2019},
  url={https://arxiv.org/abs/1904.00607}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
YouTube-VOS validation	Overall / seen J / seen F / unseen J / unseen F	79.4 / 79.7 / 84.2 / 72.8 / 80.9	Official STM evaluation	Official ICCV 2019 paper, Table 1
DAVIS-2016 validation	J / F	88.7 / 89.9 with YouTube-VOS pretraining	0.16 s/frame	Official ICCV 2019 paper, Table 2
DAVIS-2017 validation	J / F	79.2 / 84.3 with YouTube-VOS pretraining	Official STM evaluation	Official ICCV 2019 paper, Table 3

Artifacts

Official ICCV 2019 paper, repository link, weights path, mock input JSON, and local prediction output from the submitted spreadsheet.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.

STM