Leaderboard
Grouped ranking for real tools only. Draft templates are excluded.
Ranking is grouped by the four tool categories. The values shown here are source-reported benchmark numbers, so AP, J&F, REL, success rate, APE/RMSE, and runtime are not mixed into one universal score. Templates are excluded. Hotness is calculated as likes + saves from the current engagement data.
Perception and Grounding
Detection, depth, segmentation, video masks, and spatial grounding are ranked only within their comparable task scope.
| Rank | Tool | Task | Dataset | Metric | Speed / Runtime | Artifacts |
|---|---|---|---|---|---|---|
| #1 | YOLO-World | Open-vocabulary detection | COCO val2017 fine-tuning | 44.9 APBox AP: 44.9 AP for YOLO-World-L 640 | YOLO-style real-time inference | PaperRepoWeightsDemo |
| #2 | MiDaS | Monocular depth estimation | 6-dataset zero-shot benchmark | 0.1137 WHDRDIW WHDR / ETH3D AbsRel / Sintel AbsRel / TUM / KITTI / NYUv2 zero-shot error: 0.1137 / 0.0659 / 0.2366 / 6.13 / 11.56 / 1.86 for MiDaS v3.1 BEiT-L-512 | 345M params, 5.7 FPS on RTX 3090 | PaperRepoWeightsDemo |
| #3 | LINGO-Space | Language-conditioned spatial grounding | Single referring expression, 12 tabletop tasks | 80.0 SuccessSuccess score range: 80.0 to 100.0 across the 12 reported tasks | Source paper result | PaperRepoWeightsDemo |
| #4 | Feature_Squeezer | Adversarial example detection | Target model baselines | 99.43 accuracyTop-1 accuracy: MNIST 99.43%; CIFAR-10 94.84%; ImageNet MobileNet 68.36% top-1 / 88.25% top-5 | Classifier baselines used for detection evaluation | PaperRepoDemoBenchmark |
| #5 | Zero-DCE | Low-light image enhancement | Full-reference low-light enhancement comparison | 16.57 PSNRPSNR / SSIM / MAE: 16.57 / 0.59 / 98.78 | PyTorch GPU | PaperRepoWeightsDemo |
| #6 | Dense Object Nets | Dense visual correspondence for manipulation | Robot-collected object correspondence, standard-SO | 93Image-pair match precision: 93% of image pairs have normalized pixel error under 13% of the image diagonal | Descriptor correspondence evaluation | PaperDemoBenchmark |
| #7 | Cutie | Video object segmentation | DAVIS-2017 / YouTubeVOS-2019 | 88.8 J&FJ&F / G: DAVIS val 88.8, DAVIS test 85.3, YouTubeVOS G 86.5 | Cutie-small 45.5 FPS | PaperRepoWeightsDemo |
| #8 | FastSAM | Promptable segmentation | LVIS v1 | 57.1 AR@1000BBox AR@1000 / AR_s / AR_m / AR_l: 57.1 / 44.3 / 77.1 / 85.3 | 68M parameters | PaperRepoWeightsDemo |
| #9 | Grounding DINO | Open-set object detection | COCO zero-shot evaluation | 48.5 APbox AP: 48.5 expected from the official evaluation script; 48.4 zero-shot / 57.2 fine-tune for GroundingDINO-T | Swin-T checkpoint | PaperRepoWeightsDemo |
| #10 | Restormer | Image restoration | Real image denoising, SIDD / DND | 40.02 PSNRPSNR / SSIM: 40.02 / 0.960 on SIDD; 40.03 / 0.956 on DND | Task-specific denoising model | PaperRepoWeightsDemo |
| #11 | DeblurGANv2 | Image deblurring | GoPro test | 29.55 PSNRPSNR / SSIM: 29.55 / 0.934 for InceptionResNet-v2 | fpn_inception.h5 | PaperRepoWeightsDemo |
| #12 | ZoeDepth | Metric depth estimation | NYU Depth V2 | 0.955 delta1delta1 / REL / RMSE / log10: 0.955 / 0.075 / 0.270 / 0.032 for ZoeD-M12-N | 42M-345M parameters depending on backbone | PaperRepoWeightsDemo |
| #13 | Depth Anything | Monocular depth estimation | NYUv2 zero-shot benchmark | 0.043 AbsRelAbsRel / delta1: 0.043 / 0.981 for Depth Anything-L; 0.046 / 0.979 for Depth Anything-B | Large and Base encoders | PaperRepoWeightsDemo |
| #14 | CLAHE_Filter | Local contrast enhancement | CLAHE is an image-processing primitive rather than a learned model; OpenCV and the original paper do not provide a single canonical cross-dataset benchmark number for this wrapper. | No numeric benchmarkNo universal official numeric benchmark is copied here because contrast improvement is image-dependent and usually evaluated as part of a downstream perception pipeline. | Interactive according to the submitted spreadsheet; exact runtime depends on resolution, tile grid size, color space conversion, and CPU/GPU backend. | PaperRepoDemo |
Cognition and State Modeling
State tools use map quality, reconstruction quality, trajectory accuracy, memory retrieval, and relation modeling as primary evidence.
| Rank | Tool | Task | Dataset | Metric | Speed / Runtime | Artifacts |
|---|---|---|---|---|---|---|
| #1 | OctoMap | 3D occupancy mapping | New College, 10 cm resolution | 98.79 AccuracyAccuracy / cross-validation: 98.79% / 98.46% | Occupancy map evaluation | PaperRepoDemoBenchmark |
| #2 | query_historical_action_timeline | Historical action timeline query | GTEA | 87.5 F1@10F1@10 / F1@25 / F1@50; Edit; Acc: 87.5 / 85.4 / 74.6; 81.4; 79.2 | MS-TCN with fine-tuning | PaperRepoDemoBenchmark |
| #3 | STM | Space-time visual memory | YouTube-VOS validation | 79.4Overall / seen J / seen F / unseen J / unseen F: 79.4 / 79.7 / 84.2 / 72.8 / 80.9 | Official STM evaluation | PaperRepoDemoBenchmark |
| #4 | sentence-transformers | Sentence embedding | SBERT model table, 14 sentence-embedding datasets | 68.06all-MiniLM-L6-v2 sentence performance: 68.06 | 14,200 sentences/sec on V100 | PaperRepoWeightsDemo |
| #5 | Action Genome | Spatio-temporal scene graph state modeling | Action Genome / Charades few-shot action recognition | 42.7 mAPmAP with 10 examples: 42.7% | Few-shot action recognition experiment | PaperRepoDemoBenchmark |
| #6 | Hydra | 3D scene graph construction | SidPac Floor 3-4 | 75.3Component timing: Objects 75.3+/-37.0 ms, places 4.2+/-2.1 ms, rooms 15.0+/-14.6 ms | Online graph construction | PaperRepoWeightsDemo |
| #7 | retrieve_past_visual_state_faiss | Visual memory retrieval | Billion-scale similarity search | 8.5Nearest-neighbor search implementation speedup: 8.5x faster than the previous reported state of the art | GPU nearest-neighbor search implementation | PaperRepoDemoBenchmark |
| #8 | DUSt3R | Geometric 3D reconstruction | DTU zero-shot MVS | 2.677 AccuracyAccuracy / completeness / overall: 2.677 mm / 0.805 mm / 1.741 mm | Multi-view global alignment | PaperRepoWeightsDemo |
| #9 | R3LIVE | RGB-colored LIV mapping | HKUST campus loops | 0.093 driftLoop drift: 0.093 m, 0.154 m, 0.164 m, 0.102 m over 1.19-1.52 km trajectories | Real-time mapping pipeline | PaperRepoWeightsDemo |
| #10 | query_3d_scene_graph | Queryable 3D scene memory | 3RScan / 3DSSG, full scene with GT instances | 0.70Object / predicate recall: Object R@5 0.70, R@10 0.80; predicate R@3 0.97, R@5 0.99 | Upstream graph prediction, not local wrapper timing | PaperRepoDemoBenchmark |
| #11 | FAST-LIVO2 | LiDAR-inertial-visual odometry | Airborne mapping public sequences | 0.64 APAPE RMSE: 0.64 m / 0.27 m vs R3LIVE 2.76 m / 0.52 m | 17.13 ms LiDAR + 12.90 ms image average | PaperRepoWeightsDemo |
| #12 | reMap | Queryable semantic mapping | The bundled deployment README verifies service-level functionality on a demo scene, but it does not provide a source benchmark dataset or paper-reported numeric evaluation table. | No numeric benchmarkNo official numeric benchmark was found in the bundled public reMap materials, so the page leaves this as a deployment-validated tool rather than inventing a score. | Interactive ROS service calls are shown in the deployment notes, but no source-reported latency number is given. | RepoWeightsDemo |
Reasoning and Planning
Reasoning tools are compared by task success, plan quality, safety validation, and action-selection evidence.
| Rank | Tool | Task | Dataset | Metric | Speed / Runtime | Artifacts |
|---|---|---|---|---|---|---|
| #1 | Language2LTL | Natural language to temporal-logic validation | AP detection, Circuit / Navigation / Office email | 98.84 APAP-detect accuracy: 98.84+/-0.41% / 99.03+/-0.53% / 100.00+/-0.00% | Upstream AP detection benchmark | PaperRepoDemoBenchmark |
| #2 | Scan, Materialize, Simulate | Physically grounded scene materialization | Quadrotor landing, 4 scenes x 10 starts | 100 successLanding success rate, SMS vs visual prompting: 100% / 80% / 90% / 90% vs 50% / 50% / 60% / 50% | Genesis optimization averages 8.2 s/iteration on RTX 4090 | PaperRepoWeightsDemo |
| #3 | ActPerMoMa | Active perception for mobile manipulation | Simple scenes, 500 episodes | 95.4 SuccessSuccess / abort / grasp-failure rate: 95.4% / 1.4% / 3.2% | dtotal 3.59+/-1.69 m; vtotal 12.67+/-5.39 | PaperRepoDemoBenchmark |
| #4 | PhysVLM-AVR | Active visual reasoning | CLEVR-AVR | 84.2 AccuracyAccuracy: 84.2% | Source paper result | PaperRepoWeightsDemo |
| #5 | VIRF | Safety-verified task reasoning | SafeAgentBench | 0.0HAR / GCR / Avg correction iterations: 0.0% / 77.3% / 1.1 | Source paper result | PaperRepoDemoBenchmark |
| #6 | OMPL | Motion planning | OMPL ships benchmarking infrastructure and Planner Arena-style comparison workflows, but the library does not expose one canonical official benchmark number for all planners and problem classes. | No numeric benchmarkNo single numeric score is copied here because OMPL performance depends on the selected planner, state space, validity checker, robot geometry, and timeout. | Deployment-specific; use OMPL benchmark logs for planning time, success rate, solution length, and path quality on the target planning problem. | RepoDemo |
Execution and Control
Execution tools are compared by grasp success, trajectory feasibility, control stability, runtime, and monitoring quality.
| Rank | Tool | Task | Dataset | Metric | Speed / Runtime | Artifacts |
|---|---|---|---|---|---|---|
| #1 | AnyGrasp | 6-DoF grasp perception | Real bin-picking benchmark | 93.3 successAttempt-centric success: 93.3% AnyGrasp vs 72.2% DexNet 4.0; object completion 99.8% | 100 ms prediction, <200 ms decision time | PaperRepoWeightsDemo |
| #2 | TAPIR | Point tracking for visual servoing | TAP-Vid benchmark | 60.2 accAverage Jaccard (AJ): 60.2 / 62.9 / 88.3 / 73.3 on Kinetics / DAVIS / Kubric / RGB-Stacking | TAPIR | PaperRepoDemoBenchmark |
| #3 | monitor_dynamic_disturbance | Dynamic disturbance monitoring | TAP-Vid / RoboTAP official evaluation | 67.8 deltadelta_avg^vis: 67.8 / 76.9 / 78.0 / 85.0 on Kinetics / DAVIS / RoboTAP / RGB-S | CoTracker3 offline | PaperRepoDemoBenchmark |
| #4 | R3M | Post-action success verification | Franka Kitchen / MetaWorld / Adroit | 53.1 successR3M ablation success rate: 53.1+/-2.7% / 69.2+/-2.0% / 65.0+/-1.7%; all domains 62.4+/-1.3% | Downstream behavior cloning evaluation | PaperRepoDemoBenchmark |
| #5 | Ruckig | Jerk-limited trajectory generation | 7-DoF online trajectory generation | 19.8 timeMean / worst calculation time: 19.8+/-0.2 us / 123+/-13 us | Intel i7-8700K, single thread | PaperRepoDemoBenchmark |
| #6 | Pinocchio | Rigid-body dynamics and kinematics | 7-DoF arm to 36-DoF humanoid rigid-body derivative benchmarks | 3Analytical derivative computation cost: 3 microseconds to 17 microseconds | Pinocchio C++ implementation | RepoDemoBenchmark |
| #7 | Nav2 | ROS 2 navigation | Nav2 is a ROS 2 navigation framework; the official docs do not publish one canonical benchmark dataset or score for the whole stack. | No numeric benchmarkNo single official source-reported numeric benchmark is used here because results depend on robot platform, planner/controller plugin, map, costmap settings, localization, and behavior tree. | Deployment-specific; measure action success rate, path length, recovery count, and controller loop timing on the target robot. | RepoDemo |