Embodied Tools

Leaderboard

Grouped ranking for real tools only. Draft templates are excluded.

Ranking is grouped by the four tool categories. The values shown here are source-reported benchmark numbers, so AP, J&F, REL, success rate, APE/RMSE, and runtime are not mixed into one universal score. Templates are excluded. Hotness is calculated as likes + saves from the current engagement data.

Perception and Grounding

Detection, depth, segmentation, video masks, and spatial grounding are ranked only within their comparable task scope.

Task-specific perception metric

Rank	Tool	Task	Dataset	Metric	Speed / Runtime	Artifacts
#1	YOLO-World	Open-vocabulary detection	COCO val2017 fine-tuning	44.9 APBox AP: 44.9 AP for YOLO-World-L 640	YOLO-style real-time inference	PaperRepoWeightsDemo
#2	MiDaS	Monocular depth estimation	6-dataset zero-shot benchmark	0.1137 WHDRDIW WHDR / ETH3D AbsRel / Sintel AbsRel / TUM / KITTI / NYUv2 zero-shot error: 0.1137 / 0.0659 / 0.2366 / 6.13 / 11.56 / 1.86 for MiDaS v3.1 BEiT-L-512	345M params, 5.7 FPS on RTX 3090	PaperRepoWeightsDemo
#3	LINGO-Space	Language-conditioned spatial grounding	Single referring expression, 12 tabletop tasks	80.0 SuccessSuccess score range: 80.0 to 100.0 across the 12 reported tasks	Source paper result	PaperRepoWeightsDemo
#4	Feature_Squeezer	Adversarial example detection	Target model baselines	99.43 accuracyTop-1 accuracy: MNIST 99.43%; CIFAR-10 94.84%; ImageNet MobileNet 68.36% top-1 / 88.25% top-5	Classifier baselines used for detection evaluation	PaperRepoDemoBenchmark
#5	Zero-DCE	Low-light image enhancement	Full-reference low-light enhancement comparison	16.57 PSNRPSNR / SSIM / MAE: 16.57 / 0.59 / 98.78	PyTorch GPU	PaperRepoWeightsDemo
#6	Dense Object Nets	Dense visual correspondence for manipulation	Robot-collected object correspondence, standard-SO	93Image-pair match precision: 93% of image pairs have normalized pixel error under 13% of the image diagonal	Descriptor correspondence evaluation	PaperDemoBenchmark
#7	Cutie	Video object segmentation	DAVIS-2017 / YouTubeVOS-2019	88.8 J&FJ&F / G: DAVIS val 88.8, DAVIS test 85.3, YouTubeVOS G 86.5	Cutie-small 45.5 FPS	PaperRepoWeightsDemo
#8	FastSAM	Promptable segmentation	LVIS v1	57.1 AR@1000BBox AR@1000 / AR_s / AR_m / AR_l: 57.1 / 44.3 / 77.1 / 85.3	68M parameters	PaperRepoWeightsDemo
#9	Grounding DINO	Open-set object detection	COCO zero-shot evaluation	48.5 APbox AP: 48.5 expected from the official evaluation script; 48.4 zero-shot / 57.2 fine-tune for GroundingDINO-T	Swin-T checkpoint	PaperRepoWeightsDemo
#10	Restormer	Image restoration	Real image denoising, SIDD / DND	40.02 PSNRPSNR / SSIM: 40.02 / 0.960 on SIDD; 40.03 / 0.956 on DND	Task-specific denoising model	PaperRepoWeightsDemo
#11	DeblurGANv2	Image deblurring	GoPro test	29.55 PSNRPSNR / SSIM: 29.55 / 0.934 for InceptionResNet-v2	fpn_inception.h5	PaperRepoWeightsDemo
#12	ZoeDepth	Metric depth estimation	NYU Depth V2	0.955 delta1delta1 / REL / RMSE / log10: 0.955 / 0.075 / 0.270 / 0.032 for ZoeD-M12-N	42M-345M parameters depending on backbone	PaperRepoWeightsDemo
#13	Depth Anything	Monocular depth estimation	NYUv2 zero-shot benchmark	0.043 AbsRelAbsRel / delta1: 0.043 / 0.981 for Depth Anything-L; 0.046 / 0.979 for Depth Anything-B	Large and Base encoders	PaperRepoWeightsDemo
#14	CLAHE_Filter	Local contrast enhancement	CLAHE is an image-processing primitive rather than a learned model; OpenCV and the original paper do not provide a single canonical cross-dataset benchmark number for this wrapper.	No numeric benchmarkNo universal official numeric benchmark is copied here because contrast improvement is image-dependent and usually evaluated as part of a downstream perception pipeline.	Interactive according to the submitted spreadsheet; exact runtime depends on resolution, tile grid size, color space conversion, and CPU/GPU backend.	PaperRepoDemo

Cognition and State Modeling

State tools use map quality, reconstruction quality, trajectory accuracy, memory retrieval, and relation modeling as primary evidence.

State quality / reconstruction / memory metric

Rank	Tool	Task	Dataset	Metric	Speed / Runtime	Artifacts
#1	OctoMap	3D occupancy mapping	New College, 10 cm resolution	98.79 AccuracyAccuracy / cross-validation: 98.79% / 98.46%	Occupancy map evaluation	PaperRepoDemoBenchmark
#2	query_historical_action_timeline	Historical action timeline query	GTEA	87.5 F1@10F1@10 / F1@25 / F1@50; Edit; Acc: 87.5 / 85.4 / 74.6; 81.4; 79.2	MS-TCN with fine-tuning	PaperRepoDemoBenchmark
#3	STM	Space-time visual memory	YouTube-VOS validation	79.4Overall / seen J / seen F / unseen J / unseen F: 79.4 / 79.7 / 84.2 / 72.8 / 80.9	Official STM evaluation	PaperRepoDemoBenchmark
#4	sentence-transformers	Sentence embedding	SBERT model table, 14 sentence-embedding datasets	68.06all-MiniLM-L6-v2 sentence performance: 68.06	14,200 sentences/sec on V100	PaperRepoWeightsDemo
#5	Action Genome	Spatio-temporal scene graph state modeling	Action Genome / Charades few-shot action recognition	42.7 mAPmAP with 10 examples: 42.7%	Few-shot action recognition experiment	PaperRepoDemoBenchmark
#6	Hydra	3D scene graph construction	SidPac Floor 3-4	75.3Component timing: Objects 75.3+/-37.0 ms, places 4.2+/-2.1 ms, rooms 15.0+/-14.6 ms	Online graph construction	PaperRepoWeightsDemo
#7	retrieve_past_visual_state_faiss	Visual memory retrieval	Billion-scale similarity search	8.5Nearest-neighbor search implementation speedup: 8.5x faster than the previous reported state of the art	GPU nearest-neighbor search implementation	PaperRepoDemoBenchmark
#8	DUSt3R	Geometric 3D reconstruction	DTU zero-shot MVS	2.677 AccuracyAccuracy / completeness / overall: 2.677 mm / 0.805 mm / 1.741 mm	Multi-view global alignment	PaperRepoWeightsDemo
#9	R3LIVE	RGB-colored LIV mapping	HKUST campus loops	0.093 driftLoop drift: 0.093 m, 0.154 m, 0.164 m, 0.102 m over 1.19-1.52 km trajectories	Real-time mapping pipeline	PaperRepoWeightsDemo
#10	query_3d_scene_graph	Queryable 3D scene memory	3RScan / 3DSSG, full scene with GT instances	0.70Object / predicate recall: Object R@5 0.70, R@10 0.80; predicate R@3 0.97, R@5 0.99	Upstream graph prediction, not local wrapper timing	PaperRepoDemoBenchmark
#11	FAST-LIVO2	LiDAR-inertial-visual odometry	Airborne mapping public sequences	0.64 APAPE RMSE: 0.64 m / 0.27 m vs R3LIVE 2.76 m / 0.52 m	17.13 ms LiDAR + 12.90 ms image average	PaperRepoWeightsDemo
#12	reMap	Queryable semantic mapping	The bundled deployment README verifies service-level functionality on a demo scene, but it does not provide a source benchmark dataset or paper-reported numeric evaluation table.	No numeric benchmarkNo official numeric benchmark was found in the bundled public reMap materials, so the page leaves this as a deployment-validated tool rather than inventing a score.	Interactive ROS service calls are shown in the deployment notes, but no source-reported latency number is given.	RepoWeightsDemo

Reasoning and Planning

Reasoning tools are compared by task success, plan quality, safety validation, and action-selection evidence.

Task success / plan quality

Rank	Tool	Task	Dataset	Metric	Speed / Runtime	Artifacts
#1	Language2LTL	Natural language to temporal-logic validation	AP detection, Circuit / Navigation / Office email	98.84 APAP-detect accuracy: 98.84+/-0.41% / 99.03+/-0.53% / 100.00+/-0.00%	Upstream AP detection benchmark	PaperRepoDemoBenchmark
#2	Scan, Materialize, Simulate	Physically grounded scene materialization	Quadrotor landing, 4 scenes x 10 starts	100 successLanding success rate, SMS vs visual prompting: 100% / 80% / 90% / 90% vs 50% / 50% / 60% / 50%	Genesis optimization averages 8.2 s/iteration on RTX 4090	PaperRepoWeightsDemo
#3	ActPerMoMa	Active perception for mobile manipulation	Simple scenes, 500 episodes	95.4 SuccessSuccess / abort / grasp-failure rate: 95.4% / 1.4% / 3.2%	dtotal 3.59+/-1.69 m; vtotal 12.67+/-5.39	PaperRepoDemoBenchmark
#4	PhysVLM-AVR	Active visual reasoning	CLEVR-AVR	84.2 AccuracyAccuracy: 84.2%	Source paper result	PaperRepoWeightsDemo
#5	VIRF	Safety-verified task reasoning	SafeAgentBench	0.0HAR / GCR / Avg correction iterations: 0.0% / 77.3% / 1.1	Source paper result	PaperRepoDemoBenchmark
#6	OMPL	Motion planning	OMPL ships benchmarking infrastructure and Planner Arena-style comparison workflows, but the library does not expose one canonical official benchmark number for all planners and problem classes.	No numeric benchmarkNo single numeric score is copied here because OMPL performance depends on the selected planner, state space, validity checker, robot geometry, and timeout.	Deployment-specific; use OMPL benchmark logs for planning time, success rate, solution length, and path quality on the target planning problem.	RepoDemo

Execution and Control

Execution tools are compared by grasp success, trajectory feasibility, control stability, runtime, and monitoring quality.

Success rate / control quality

Rank	Tool	Task	Dataset	Metric	Speed / Runtime	Artifacts
#1	AnyGrasp	6-DoF grasp perception	Real bin-picking benchmark	93.3 successAttempt-centric success: 93.3% AnyGrasp vs 72.2% DexNet 4.0; object completion 99.8%	100 ms prediction, <200 ms decision time	PaperRepoWeightsDemo
#2	TAPIR	Point tracking for visual servoing	TAP-Vid benchmark	60.2 accAverage Jaccard (AJ): 60.2 / 62.9 / 88.3 / 73.3 on Kinetics / DAVIS / Kubric / RGB-Stacking	TAPIR	PaperRepoDemoBenchmark
#3	monitor_dynamic_disturbance	Dynamic disturbance monitoring	TAP-Vid / RoboTAP official evaluation	67.8 deltadelta_avg^vis: 67.8 / 76.9 / 78.0 / 85.0 on Kinetics / DAVIS / RoboTAP / RGB-S	CoTracker3 offline	PaperRepoDemoBenchmark
#4	R3M	Post-action success verification	Franka Kitchen / MetaWorld / Adroit	53.1 successR3M ablation success rate: 53.1+/-2.7% / 69.2+/-2.0% / 65.0+/-1.7%; all domains 62.4+/-1.3%	Downstream behavior cloning evaluation	PaperRepoDemoBenchmark
#5	Ruckig	Jerk-limited trajectory generation	7-DoF online trajectory generation	19.8 timeMean / worst calculation time: 19.8+/-0.2 us / 123+/-13 us	Intel i7-8700K, single thread	PaperRepoDemoBenchmark
#6	Pinocchio	Rigid-body dynamics and kinematics	7-DoF arm to 36-DoF humanoid rigid-body derivative benchmarks	3Analytical derivative computation cost: 3 microseconds to 17 microseconds	Pinocchio C++ implementation	RepoDemoBenchmark
#7	Nav2	ROS 2 navigation	Nav2 is a ROS 2 navigation framework; the official docs do not publish one canonical benchmark dataset or score for the whole stack.	No numeric benchmarkNo single official source-reported numeric benchmark is used here because results depend on robot platform, planner/controller plugin, map, costmap settings, localization, and behavior tree.	Deployment-specific; measure action success rate, path length, recovery count, and controller loop timing on the target robot.	RepoDemo