Perception and Grounding

YOLO-World

YOLO-World is a real-time open-vocabulary object detector that uses image inputs and text prompts to localize arbitrary object categories.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Upload an image, provide the object vocabulary you want to find, and YOLO-World returns labeled bounding boxes with confidence scores.

InputImage + text prompts

OutputBounding boxes, labels, scores

Trigger TimingTriggered on demand from the source demo or local example command.

RuntimePython / ONNX / demo

BeforeImage + text prompts

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterBounding boxes, labels, scores

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/yolo-world/demo/sample_images/bus.jpg

Promptperson,bus,car

ExpectedAn annotated image plus JSON-style detections containing boxes, labels, and scores.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

imagefile

The RGB image that will be scanned for the requested object names.

promptstextperson,bus,car

Comma-separated vocabulary. The detector only reports objects matching this user vocabulary.

thresholdslider0.05

Minimum confidence score retained in the visualization. Raising it removes weak detections.

topknumber100

Maximum number of boxes kept before visualization or export.

Output Explanation

bbox

The predicted box coordinates around each detected object.

label

The matched text category from the prompt vocabulary.

score

Detection confidence; higher values indicate stronger text-image matching.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/AILab-CVC/YOLO-World Hugging Face Demohttps://huggingface.co/spaces/stevengrove/YOLO-World Code Downloadhttps://github.com/AILab-CVC/YOLO-World/archive/refs/heads/main.zip arXivhttps://arxiv.org/abs/2401.17270 YOLO-World Model Cardhttps://github.com/AILab-CVC/YOLO-World#model-card YOLO-World Hugging Facehttps://huggingface.co/wondervictor/YOLO-World

Deployment Notes

Clone the official repository with submodules, then install the editable package and MMYOLO/MMSeg-style dependencies.
Download one of the official YOLO-World weights from the model card or Hugging Face links.
Run the image demo with a relative image path, config path, checkpoint path, and comma-separated vocabulary.
Export annotated images and prediction JSON under tools/yolo-world/runs/ for the catalog workflow.

Relative Path Example

# Relative-path local entry for the YOLO-World tool folder
python tools/yolo-world/demo/image_demo.py tools/yolo-world/demo/sample_images/bus.jpg   tools/yolo-world/configs/pretrain/yolo_world_v2_xl.py   tools/yolo-world/weights/yolo_world_v2_xl_obj365v1_goldg_pretrain-5daf1395.pth   "person,bus,car"   --topk 100   --threshold 0.05   --output-dir tools/yolo-world/runs/detect

# Use this as a documentation path. The static page does not execute the model.

Expected Result Shape

{
  "tool": "yolo-world",
  "status": "ok",
  "results": [
    {
      "label": "Open-vocabulary detection",
      "score": 0.87,
      "output": "Bounding boxes, labels, scores"
    }
  ],
  "timing": {
    "runtime": "52.0 FPS on one NVIDIA V100 without TensorRT for the re-parameterized YOLO-World-L; the original non-re-parameterized version is reported at 17.6 FPS.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/yolo-world/runs/visualization.png",
    "raw_predictions": "tools/yolo-world/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitleYOLO-World: Real-Time Open-Vocabulary Object Detection

AuthorsTianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

VenueCVPR 2024 / arXiv:2401.17270

ContributionConnects YOLO-style real-time detection with open-vocabulary text conditioning, making prompt-driven detection practical for fast perception and grounding workflows.

Citation

@misc{yoloworld2024,
  title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
  author={Tianheng Cheng and Lin Song and Yixiao Ge and Wenyu Liu and Xinggang Wang and Ying Shan},
  year={2024},
  note={CVPR 2024 / arXiv:2401.17270},
  url={https://arxiv.org/abs/2401.17270}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
LVIS minival zero-shot	Fixed AP / AP_r / AP_c / AP_f	35.4 / 27.6 / 34.1 / 38.0	52.0 FPS on V100	CVPR 2024 paper
COCO val2017 fine-tuning	Box AP	44.9 AP for YOLO-World-L 640	YOLO-style real-time inference	Official repository model card

Artifacts

Official paper, speed-accuracy figure, LVIS/COCO evaluation tables, weights, configs, demo scripts, and ONNX export notes.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.