Short Explanation
Use PhysVLM-AVR when the agent must reason over an evolving scene instead of answering from one static fully observed image.
PhysVLM-AVR is a multimodal reasoning model for partially observable environments that plans actions, integrates observations over time, and answers questions about physical scenes.
Core parameters, trigger timing, and visual before/after demo references.
Use PhysVLM-AVR when the agent must reason over an evolving scene instead of answering from one static fully observed image.
Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.
Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.
A quick-run style example for the documentation page.
Readable controls and the meaning of each returned artifact.
image_pathspathOne or more scene images passed to the inference server.
querytextQuestion about the scene or interaction outcome.
deviceselectcpuLocal inference device used by the server entrypoint.
checkpoint_dirpathtools/physvlm_avr/repo/physvlm-avr/checkpoints/physvlm-qwen2-3B-avr-stage3-avr-core-v3Repository-relative checkpoint path required for real model inference.
answerNatural-language answer returned by the model or fallback path.
reasoningReasoning text describing how the answer was reached.
confidenceConfidence score returned by the server.
modeExecution mode such as real-model inference or mock fallback.
Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.
# Relative-path local entry for the PhysVLM-AVR deployment cd tools/physvlm_avr ./run_example.sh # Direct Python entry: conda run -n physvlm-avr python run_physvlm_example.py # Server-only entry: cd tools/physvlm_avr/repo/physvlm-avr conda run -n physvlm-avr env PHYSVLM_PORT=8000 PHYSVLM_DEVICE=cpu python start_physvlm_server.py
{
"tool": "physvlm_avr",
"status": "ok",
"results": [
{
"label": "Active visual reasoning",
"score": 0.87,
"output": "Answer, reasoning trace, confidence, server response JSON"
}
],
"timing": {
"runtime": "The local deployment README does not provide a source-reported latency number; the bundled example runs through a FastAPI server and can fall back to mock mode on CPU.",
"device": "documented in source benchmark when available"
},
"artifacts": {
"visualization": "tools/physvlm_avr/runs/visualization.png",
"raw_predictions": "tools/physvlm_avr/runs/predictions.json"
}
}Paper identity and contribution summary.
@misc{physvlm_avr2025,
title={PhysVLM-AVR},
author={Author},
year={2025},
note={OpenReview / arXiv preprint, 2025},
url={https://openreview.net/forum?id=kUN2R6X4jS}
}Only compact, source-reported numbers are shown here.
| Dataset | Metric | Value | Runtime | Source |
|---|---|---|---|---|
| CLEVR-AVR | Accuracy | 84.2% | Source paper result | OpenReview paper |
| RoboVQA | Accuracy | 78.0% | Source paper result | OpenReview paper |
Official repository README, OpenReview paper page, deployment scripts, example input/output JSON, generated scene images, and server logs.
Visual references from the original tool. Click any image to inspect the original size.