Short Explanation
Use STM when a target has changed appearance but earlier clean frames can still anchor the current prediction.
Maintains a space-time feature memory so a VLM can recover target appearance under deformation, blur, or lighting changes.
Core parameters, trigger timing, and visual before/after demo references.
Use STM when a target has changed appearance but earlier clean frames can still anchor the current prediction.
Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.
Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.
A quick-run style example for the documentation page.
Readable controls and the meaning of each returned artifact.
ref_imgfiletools/stm/examples/mock_ref_img.pngReference frame where the target is clearly observed.
ref_maskfiletools/stm/examples/mock_ref_mask.pngTarget mask associated with the reference frame.
query_imgfiletools/stm/examples/mock_query_img.pngCurrent frame to segment or verify.
weightspathtools/stm/weights/STM_weights.pthSTM model weights used by the local backend.
retrieved_feature_mapMemory-conditioned feature map used for the current frame.
visual_attention_weightsAttention weights over stored space-time memory.
confidence_scoreConfidence of the current target prediction.
Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.
python tools/stm/run.py --ref-img tools/stm/examples/mock_ref_img.png --ref-mask tools/stm/examples/mock_ref_mask.png --query-img tools/stm/examples/mock_query_img.png --weights tools/stm/weights/STM_weights.pth --output tools/stm/runs/prediction.json
{
"tool": "stm",
"status": "ok",
"results": [
{
"label": "Space-time visual memory",
"score": 0.87,
"output": "Memory readout, attention map, confidence score"
}
],
"timing": {
"runtime": "The official paper reports 0.16 s/frame on DAVIS-2016; the submitted spreadsheet separately described about 20 ms for the local wrapper.",
"device": "documented in source benchmark when available"
},
"artifacts": {
"visualization": "tools/stm/runs/visualization.png",
"raw_predictions": "tools/stm/runs/predictions.json"
}
}Paper identity and contribution summary.
@misc{stm2019,
title={Video Object Segmentation using Space-Time Memory Networks},
author={Author},
year={2019},
note={ICCV 2019},
url={https://arxiv.org/abs/1904.00607}
}Only compact, source-reported numbers are shown here.
| Dataset | Metric | Value | Runtime | Source |
|---|---|---|---|---|
| YouTube-VOS validation | Overall / seen J / seen F / unseen J / unseen F | 79.4 / 79.7 / 84.2 / 72.8 / 80.9 | Official STM evaluation | Official ICCV 2019 paper, Table 1 |
| DAVIS-2016 validation | J / F | 88.7 / 89.9 with YouTube-VOS pretraining | 0.16 s/frame | Official ICCV 2019 paper, Table 2 |
| DAVIS-2017 validation | J / F | 79.2 / 84.3 with YouTube-VOS pretraining | Official STM evaluation | Official ICCV 2019 paper, Table 3 |
Official ICCV 2019 paper, repository link, weights path, mock input JSON, and local prediction output from the submitted spreadsheet.
Visual references from the original tool. Click any image to inspect the original size.