Perception and Grounding

Depth Anything

Foundation depth model for robust relative depth prediction from a single RGB image.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Given one RGB image, Depth Anything predicts a dense relative depth map.

InputRGB image
OutputDepth map
Trigger TimingTriggered on demand after the required input files and configuration are prepared.
RuntimePython / PyTorch
BeforeRGB image

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterDepth map

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/depth-anything/examples/input.jpg
Promptencoder: vitl
ExpectedA normalized depth map image aligned with the input frame.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

img_pathfile

Input RGB image file.

encoderselectvitl

Backbone variant (vits, vitb, vitl).

outdirpath

Directory for exported depth maps.

Output Explanation

depth_map

Predicted per-pixel relative depth.

vis_depth

Colorized depth map for visualization.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Deployment Notes

  1. Install dependencies and model checkpoints per official README.
  2. Select an encoder checkpoint that matches resource constraints.
  3. Run image inference with repository-relative paths.
  4. Save outputs under tools/depth-anything/runs/ for downstream tasks.

Relative Path Example

python run.py --img-path tools/depth-anything/examples/input.jpg --encoder vitl --outdir tools/depth-anything/runs

Expected Result Shape

{
  "tool": "depth-anything",
  "status": "ok",
  "depth_map": [
    {
      "label": "Monocular depth estimation",
      "score": 0.87,
      "output": "Depth map"
    }
  ],
  "timing": {
    "runtime": "The official README reports inference time on V100 / A100 / RTX4090 TensorRT as 12 / 8 / 3 ms for Small, 13 / 9 / 6 ms for Base, and 20 / 13 / 12 ms for Large.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/depth-anything/runs/visualization.png",
    "raw_predictions": "tools/depth-anything/runs/predictions.json"
  }
}
Paper figure

Academic Info

Paper identity and contribution summary.

TitleDepth Anything: Unleashing the Power of Large-Scale Unlabeled Data
AuthorsLihe Yang, Bingyi Kang, Zilong Huang, et al.
VenueCVPR 2024 / arXiv:2401.10891
ContributionBuilds a scalable depth foundation model using large-scale pseudo-labeled and unlabeled data.

Citation

@misc{depthanything2024,
  title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
  author={Lihe Yang and Bingyi Kang and Zilong Huang and et al.},
  year={2024},
  note={CVPR 2024 / arXiv:2401.10891},
  url={https://arxiv.org/abs/2401.10891}
}

Benchmark

Only compact, source-reported numbers are shown here.

DatasetMetricValueRuntimeSource
KITTI zero-shot benchmarkAbsRel / delta10.076 / 0.947 for Depth Anything-L; 0.080 / 0.939 for Depth Anything-BLarge and Base encodersOfficial README
NYUv2 zero-shot benchmarkAbsRel / delta10.043 / 0.981 for Depth Anything-L; 0.046 / 0.979 for Depth Anything-BLarge and Base encodersOfficial README

Artifacts

Official checkpoints, run script, and zero-shot benchmark tables from the README.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.