Perception and Grounding

Grounding DINO

Text-conditioned detector that grounds natural language prompts to image regions.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Input an image and text phrases, then Grounding DINO returns grounded boxes with confidence scores.

InputImage + text prompt

OutputBounding boxes + labels + scores

Trigger TimingTriggered on demand after the required input files and configuration are prepared.

RuntimePython / PyTorch

BeforeImage + text prompt

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterBounding boxes + labels + scores

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/grounding-dino/examples/input.jpg

Promptmug . cup . bottle

ExpectedAnnotated image and box/label/score predictions.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

imagefile

Input RGB image.

text_prompttextmug . cup . bottle

Dot-separated category words or phrases.

box_thresholdslider0.35

Minimum confidence for predicted boxes.

text_thresholdslider0.25

Minimum phrase similarity threshold.

Output Explanation

boxes

Predicted region coordinates.

phrases

Matched text phrases for each box.

scores

Confidence values for grounded detections.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/IDEA-Research/GroundingDINO Code Downloadhttps://github.com/IDEA-Research/GroundingDINO/archive/refs/heads/main.zip arXivhttps://arxiv.org/abs/2303.05499 Checkpoint Linkshttps://github.com/IDEA-Research/GroundingDINO#weights

Deployment Notes

Install Grounding DINO dependencies and build optional CUDA extensions if required.
Download official checkpoints and config files.
Run image demo with text prompt and thresholds.
Save visualizations and prediction JSON under tools/grounding-dino/runs/.

Relative Path Example

python demo/inference_on_a_image.py -c tools/grounding-dino/config/GroundingDINO_SwinT_OGC.py -p tools/grounding-dino/weights/groundingdino_swint_ogc.pth -i tools/grounding-dino/examples/input.jpg -t "mug . cup . bottle" -o tools/grounding-dino/runs

Expected Result Shape

{
  "tool": "grounding-dino",
  "status": "ok",
  "results": [
    {
      "label": "Open-set object detection",
      "score": 0.87,
      "output": "Bounding boxes + labels + scores"
    }
  ],
  "timing": {
    "runtime": "The repository exposes lighter Swin-T and stronger Swin-B checkpoints; the benchmark section highlights explicit AP numbers rather than a single official latency figure.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/grounding-dino/runs/visualization.png",
    "raw_predictions": "tools/grounding-dino/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitleGrounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

AuthorsShilong Liu, Zhaoyang Zeng, Tianhe Ren, et al.

VenuearXiv:2303.05499

ContributionCombines detector pretraining and language grounding to support open-set phrase-conditioned detection.

Citation

@misc{groundingdinoYEAR,
  title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
  author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and et al.},
  year={YEAR},
  note={arXiv:2303.05499},
  url={https://arxiv.org/abs/2303.05499}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
COCO zero-shot evaluation	box AP	48.5 expected from the official evaluation script; 48.4 zero-shot / 57.2 fine-tune for GroundingDINO-T	Swin-T checkpoint	Official README and model table
COCO object detection checkpoints	box AP	56.7 for GroundingDINO-B	Swin-B checkpoint	Official model table

Artifacts

Official config files, pretrained weights, benchmark table, and demo outputs.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.