Leaderboard

Grouped ranking for real tools only. Draft templates are excluded.

Ranking is grouped by the four tool categories. The values shown here are source-reported benchmark numbers, so AP, J&F, REL, success rate, APE/RMSE, and runtime are not mixed into one universal score. Templates are excluded. Hotness is calculated as likes + saves from the current engagement data.

Perception and Grounding

Detection, depth, segmentation, video masks, and spatial grounding are ranked only within their comparable task scope.

Task-specific perception metric
RankToolTaskDatasetMetricSpeed / RuntimeArtifacts
#1YOLO-WorldOpen-vocabulary detectionCOCO val2017 fine-tuning
44.9 APBox AP: 44.9 AP for YOLO-World-L 640
YOLO-style real-time inference
PaperRepoWeightsDemo
#2MiDaSMonocular depth estimation6-dataset zero-shot benchmark
0.1137 WHDRDIW WHDR / ETH3D AbsRel / Sintel AbsRel / TUM / KITTI / NYUv2 zero-shot error: 0.1137 / 0.0659 / 0.2366 / 6.13 / 11.56 / 1.86 for MiDaS v3.1 BEiT-L-512
345M params, 5.7 FPS on RTX 3090
PaperRepoWeightsDemo
#3LINGO-SpaceLanguage-conditioned spatial groundingSingle referring expression, 12 tabletop tasks
80.0 SuccessSuccess score range: 80.0 to 100.0 across the 12 reported tasks
Source paper result
PaperRepoWeightsDemo
#4Feature_SqueezerAdversarial example detectionTarget model baselines
99.43 accuracyTop-1 accuracy: MNIST 99.43%; CIFAR-10 94.84%; ImageNet MobileNet 68.36% top-1 / 88.25% top-5
Classifier baselines used for detection evaluation
PaperRepoDemoBenchmark
#5Zero-DCELow-light image enhancementFull-reference low-light enhancement comparison
16.57 PSNRPSNR / SSIM / MAE: 16.57 / 0.59 / 98.78
PyTorch GPU
PaperRepoWeightsDemo
#6Dense Object NetsDense visual correspondence for manipulationRobot-collected object correspondence, standard-SO
93Image-pair match precision: 93% of image pairs have normalized pixel error under 13% of the image diagonal
Descriptor correspondence evaluation
PaperDemoBenchmark
#7CutieVideo object segmentationDAVIS-2017 / YouTubeVOS-2019
88.8 J&FJ&F / G: DAVIS val 88.8, DAVIS test 85.3, YouTubeVOS G 86.5
Cutie-small 45.5 FPS
PaperRepoWeightsDemo
#8FastSAMPromptable segmentationLVIS v1
57.1 AR@1000BBox AR@1000 / AR_s / AR_m / AR_l: 57.1 / 44.3 / 77.1 / 85.3
68M parameters
PaperRepoWeightsDemo
#9Grounding DINOOpen-set object detectionCOCO zero-shot evaluation
48.5 APbox AP: 48.5 expected from the official evaluation script; 48.4 zero-shot / 57.2 fine-tune for GroundingDINO-T
Swin-T checkpoint
PaperRepoWeightsDemo
#10RestormerImage restorationReal image denoising, SIDD / DND
40.02 PSNRPSNR / SSIM: 40.02 / 0.960 on SIDD; 40.03 / 0.956 on DND
Task-specific denoising model
PaperRepoWeightsDemo
#11DeblurGANv2Image deblurringGoPro test
29.55 PSNRPSNR / SSIM: 29.55 / 0.934 for InceptionResNet-v2
fpn_inception.h5
PaperRepoWeightsDemo
#12ZoeDepthMetric depth estimationNYU Depth V2
0.955 delta1delta1 / REL / RMSE / log10: 0.955 / 0.075 / 0.270 / 0.032 for ZoeD-M12-N
42M-345M parameters depending on backbone
PaperRepoWeightsDemo
#13Depth AnythingMonocular depth estimationNYUv2 zero-shot benchmark
0.043 AbsRelAbsRel / delta1: 0.043 / 0.981 for Depth Anything-L; 0.046 / 0.979 for Depth Anything-B
Large and Base encoders
PaperRepoWeightsDemo
#14CLAHE_FilterLocal contrast enhancementCLAHE is an image-processing primitive rather than a learned model; OpenCV and the original paper do not provide a single canonical cross-dataset benchmark number for this wrapper.
No numeric benchmarkNo universal official numeric benchmark is copied here because contrast improvement is image-dependent and usually evaluated as part of a downstream perception pipeline.
Interactive according to the submitted spreadsheet; exact runtime depends on resolution, tile grid size, color space conversion, and CPU/GPU backend.
PaperRepoDemo

Cognition and State Modeling

State tools use map quality, reconstruction quality, trajectory accuracy, memory retrieval, and relation modeling as primary evidence.

State quality / reconstruction / memory metric
RankToolTaskDatasetMetricSpeed / RuntimeArtifacts
#1OctoMap3D occupancy mappingNew College, 10 cm resolution
98.79 AccuracyAccuracy / cross-validation: 98.79% / 98.46%
Occupancy map evaluation
PaperRepoDemoBenchmark
#2query_historical_action_timelineHistorical action timeline queryGTEA
87.5 F1@10F1@10 / F1@25 / F1@50; Edit; Acc: 87.5 / 85.4 / 74.6; 81.4; 79.2
MS-TCN with fine-tuning
PaperRepoDemoBenchmark
#3STMSpace-time visual memoryYouTube-VOS validation
79.4Overall / seen J / seen F / unseen J / unseen F: 79.4 / 79.7 / 84.2 / 72.8 / 80.9
Official STM evaluation
PaperRepoDemoBenchmark
#4sentence-transformersSentence embeddingSBERT model table, 14 sentence-embedding datasets
68.06all-MiniLM-L6-v2 sentence performance: 68.06
14,200 sentences/sec on V100
PaperRepoWeightsDemo
#5Action GenomeSpatio-temporal scene graph state modelingAction Genome / Charades few-shot action recognition
42.7 mAPmAP with 10 examples: 42.7%
Few-shot action recognition experiment
PaperRepoDemoBenchmark
#6Hydra3D scene graph constructionSidPac Floor 3-4
75.3Component timing: Objects 75.3+/-37.0 ms, places 4.2+/-2.1 ms, rooms 15.0+/-14.6 ms
Online graph construction
PaperRepoWeightsDemo
#7retrieve_past_visual_state_faissVisual memory retrievalBillion-scale similarity search
8.5Nearest-neighbor search implementation speedup: 8.5x faster than the previous reported state of the art
GPU nearest-neighbor search implementation
PaperRepoDemoBenchmark
#8DUSt3RGeometric 3D reconstructionDTU zero-shot MVS
2.677 AccuracyAccuracy / completeness / overall: 2.677 mm / 0.805 mm / 1.741 mm
Multi-view global alignment
PaperRepoWeightsDemo
#9R3LIVERGB-colored LIV mappingHKUST campus loops
0.093 driftLoop drift: 0.093 m, 0.154 m, 0.164 m, 0.102 m over 1.19-1.52 km trajectories
Real-time mapping pipeline
PaperRepoWeightsDemo
#10query_3d_scene_graphQueryable 3D scene memory3RScan / 3DSSG, full scene with GT instances
0.70Object / predicate recall: Object R@5 0.70, R@10 0.80; predicate R@3 0.97, R@5 0.99
Upstream graph prediction, not local wrapper timing
PaperRepoDemoBenchmark
#11FAST-LIVO2LiDAR-inertial-visual odometryAirborne mapping public sequences
0.64 APAPE RMSE: 0.64 m / 0.27 m vs R3LIVE 2.76 m / 0.52 m
17.13 ms LiDAR + 12.90 ms image average
PaperRepoWeightsDemo
#12reMapQueryable semantic mappingThe bundled deployment README verifies service-level functionality on a demo scene, but it does not provide a source benchmark dataset or paper-reported numeric evaluation table.
No numeric benchmarkNo official numeric benchmark was found in the bundled public reMap materials, so the page leaves this as a deployment-validated tool rather than inventing a score.
Interactive ROS service calls are shown in the deployment notes, but no source-reported latency number is given.
RepoWeightsDemo

Reasoning and Planning

Reasoning tools are compared by task success, plan quality, safety validation, and action-selection evidence.

Task success / plan quality
RankToolTaskDatasetMetricSpeed / RuntimeArtifacts
#1Language2LTLNatural language to temporal-logic validationAP detection, Circuit / Navigation / Office email
98.84 APAP-detect accuracy: 98.84+/-0.41% / 99.03+/-0.53% / 100.00+/-0.00%
Upstream AP detection benchmark
PaperRepoDemoBenchmark
#2Scan, Materialize, SimulatePhysically grounded scene materializationQuadrotor landing, 4 scenes x 10 starts
100 successLanding success rate, SMS vs visual prompting: 100% / 80% / 90% / 90% vs 50% / 50% / 60% / 50%
Genesis optimization averages 8.2 s/iteration on RTX 4090
PaperRepoWeightsDemo
#3ActPerMoMaActive perception for mobile manipulationSimple scenes, 500 episodes
95.4 SuccessSuccess / abort / grasp-failure rate: 95.4% / 1.4% / 3.2%
dtotal 3.59+/-1.69 m; vtotal 12.67+/-5.39
PaperRepoDemoBenchmark
#4PhysVLM-AVRActive visual reasoningCLEVR-AVR
84.2 AccuracyAccuracy: 84.2%
Source paper result
PaperRepoWeightsDemo
#5VIRFSafety-verified task reasoningSafeAgentBench
0.0HAR / GCR / Avg correction iterations: 0.0% / 77.3% / 1.1
Source paper result
PaperRepoDemoBenchmark
#6OMPLMotion planningOMPL ships benchmarking infrastructure and Planner Arena-style comparison workflows, but the library does not expose one canonical official benchmark number for all planners and problem classes.
No numeric benchmarkNo single numeric score is copied here because OMPL performance depends on the selected planner, state space, validity checker, robot geometry, and timeout.
Deployment-specific; use OMPL benchmark logs for planning time, success rate, solution length, and path quality on the target planning problem.
RepoDemo

Execution and Control

Execution tools are compared by grasp success, trajectory feasibility, control stability, runtime, and monitoring quality.

Success rate / control quality
RankToolTaskDatasetMetricSpeed / RuntimeArtifacts
#1AnyGrasp6-DoF grasp perceptionReal bin-picking benchmark
93.3 successAttempt-centric success: 93.3% AnyGrasp vs 72.2% DexNet 4.0; object completion 99.8%
100 ms prediction, <200 ms decision time
PaperRepoWeightsDemo
#2TAPIRPoint tracking for visual servoingTAP-Vid benchmark
60.2 accAverage Jaccard (AJ): 60.2 / 62.9 / 88.3 / 73.3 on Kinetics / DAVIS / Kubric / RGB-Stacking
TAPIR
PaperRepoDemoBenchmark
#3monitor_dynamic_disturbanceDynamic disturbance monitoringTAP-Vid / RoboTAP official evaluation
67.8 deltadelta_avg^vis: 67.8 / 76.9 / 78.0 / 85.0 on Kinetics / DAVIS / RoboTAP / RGB-S
CoTracker3 offline
PaperRepoDemoBenchmark
#4R3MPost-action success verificationFranka Kitchen / MetaWorld / Adroit
53.1 successR3M ablation success rate: 53.1+/-2.7% / 69.2+/-2.0% / 65.0+/-1.7%; all domains 62.4+/-1.3%
Downstream behavior cloning evaluation
PaperRepoDemoBenchmark
#5RuckigJerk-limited trajectory generation7-DoF online trajectory generation
19.8 timeMean / worst calculation time: 19.8+/-0.2 us / 123+/-13 us
Intel i7-8700K, single thread
PaperRepoDemoBenchmark
#6PinocchioRigid-body dynamics and kinematics7-DoF arm to 36-DoF humanoid rigid-body derivative benchmarks
3Analytical derivative computation cost: 3 microseconds to 17 microseconds
Pinocchio C++ implementation
RepoDemoBenchmark
#7Nav2ROS 2 navigationNav2 is a ROS 2 navigation framework; the official docs do not publish one canonical benchmark dataset or score for the whole stack.
No numeric benchmarkNo single official source-reported numeric benchmark is used here because results depend on robot platform, planner/controller plugin, map, costmap settings, localization, and behavior tree.
Deployment-specific; measure action success rate, path length, recovery count, and controller loop timing on the target robot.
RepoDemo