Short Explanation
Upload an image, provide the object vocabulary you want to find, and YOLO-World returns labeled bounding boxes with confidence scores.
YOLO-World is a real-time open-vocabulary object detector that uses image inputs and text prompts to localize arbitrary object categories.
Core parameters, trigger timing, and visual before/after demo references.
Upload an image, provide the object vocabulary you want to find, and YOLO-World returns labeled bounding boxes with confidence scores.
Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.
Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.
A quick-run style example for the documentation page.
Readable controls and the meaning of each returned artifact.
imagefileThe RGB image that will be scanned for the requested object names.
promptstextperson,bus,carComma-separated vocabulary. The detector only reports objects matching this user vocabulary.
thresholdslider0.05Minimum confidence score retained in the visualization. Raising it removes weak detections.
topknumber100Maximum number of boxes kept before visualization or export.
bboxThe predicted box coordinates around each detected object.
labelThe matched text category from the prompt vocabulary.
scoreDetection confidence; higher values indicate stronger text-image matching.
Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.
# Relative-path local entry for the YOLO-World tool folder python tools/yolo-world/demo/image_demo.py tools/yolo-world/demo/sample_images/bus.jpg tools/yolo-world/configs/pretrain/yolo_world_v2_xl.py tools/yolo-world/weights/yolo_world_v2_xl_obj365v1_goldg_pretrain-5daf1395.pth "person,bus,car" --topk 100 --threshold 0.05 --output-dir tools/yolo-world/runs/detect # Use this as a documentation path. The static page does not execute the model.
{
"tool": "yolo-world",
"status": "ok",
"results": [
{
"label": "Open-vocabulary detection",
"score": 0.87,
"output": "Bounding boxes, labels, scores"
}
],
"timing": {
"runtime": "52.0 FPS on one NVIDIA V100 without TensorRT for the re-parameterized YOLO-World-L; the original non-re-parameterized version is reported at 17.6 FPS.",
"device": "documented in source benchmark when available"
},
"artifacts": {
"visualization": "tools/yolo-world/runs/visualization.png",
"raw_predictions": "tools/yolo-world/runs/predictions.json"
}
}Paper identity and contribution summary.
@misc{yoloworld2024,
title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
author={Tianheng Cheng and Lin Song and Yixiao Ge and Wenyu Liu and Xinggang Wang and Ying Shan},
year={2024},
note={CVPR 2024 / arXiv:2401.17270},
url={https://arxiv.org/abs/2401.17270}
}Only compact, source-reported numbers are shown here.
| Dataset | Metric | Value | Runtime | Source |
|---|---|---|---|---|
| LVIS minival zero-shot | Fixed AP / AP_r / AP_c / AP_f | 35.4 / 27.6 / 34.1 / 38.0 | 52.0 FPS on V100 | CVPR 2024 paper |
| COCO val2017 fine-tuning | Box AP | 44.9 AP for YOLO-World-L 640 | YOLO-style real-time inference | Official repository model card |
Official paper, speed-accuracy figure, LVIS/COCO evaluation tables, weights, configs, demo scripts, and ONNX export notes.
Visual references from the original tool. Click any image to inspect the original size.