Perception and Grounding

Grounding DINO

Text-conditioned detector that grounds natural language prompts to image regions.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Input an image and text phrases, then Grounding DINO returns grounded boxes with confidence scores.

InputImage + text prompt
OutputBounding boxes + labels + scores
Trigger TimingTriggered on demand after the required input files and configuration are prepared.
RuntimePython / PyTorch
BeforeImage + text prompt

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterBounding boxes + labels + scores

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/grounding-dino/examples/input.jpg
Promptmug . cup . bottle
ExpectedAnnotated image and box/label/score predictions.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

imagefile

Input RGB image.

text_prompttextmug . cup . bottle

Dot-separated category words or phrases.

box_thresholdslider0.35

Minimum confidence for predicted boxes.

text_thresholdslider0.25

Minimum phrase similarity threshold.

Output Explanation

boxes

Predicted region coordinates.

phrases

Matched text phrases for each box.

scores

Confidence values for grounded detections.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Deployment Notes

  1. Install Grounding DINO dependencies and build optional CUDA extensions if required.
  2. Download official checkpoints and config files.
  3. Run image demo with text prompt and thresholds.
  4. Save visualizations and prediction JSON under tools/grounding-dino/runs/.

Relative Path Example

python demo/inference_on_a_image.py -c tools/grounding-dino/config/GroundingDINO_SwinT_OGC.py -p tools/grounding-dino/weights/groundingdino_swint_ogc.pth -i tools/grounding-dino/examples/input.jpg -t "mug . cup . bottle" -o tools/grounding-dino/runs

Expected Result Shape

{
  "tool": "grounding-dino",
  "status": "ok",
  "results": [
    {
      "label": "Open-set object detection",
      "score": 0.87,
      "output": "Bounding boxes + labels + scores"
    }
  ],
  "timing": {
    "runtime": "The repository exposes lighter Swin-T and stronger Swin-B checkpoints; the benchmark section highlights explicit AP numbers rather than a single official latency figure.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/grounding-dino/runs/visualization.png",
    "raw_predictions": "tools/grounding-dino/runs/predictions.json"
  }
}
Paper figure

Academic Info

Paper identity and contribution summary.

TitleGrounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
AuthorsShilong Liu, Zhaoyang Zeng, Tianhe Ren, et al.
VenuearXiv:2303.05499
ContributionCombines detector pretraining and language grounding to support open-set phrase-conditioned detection.

Citation

@misc{groundingdinoYEAR,
  title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
  author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and et al.},
  year={YEAR},
  note={arXiv:2303.05499},
  url={https://arxiv.org/abs/2303.05499}
}

Benchmark

Only compact, source-reported numbers are shown here.

DatasetMetricValueRuntimeSource
COCO zero-shot evaluationbox AP48.5 expected from the official evaluation script; 48.4 zero-shot / 57.2 fine-tune for GroundingDINO-TSwin-T checkpointOfficial README and model table
COCO object detection checkpointsbox AP56.7 for GroundingDINO-BSwin-B checkpointOfficial model table

Artifacts

Official config files, pretrained weights, benchmark table, and demo outputs.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.