Reasoning and Planning

PhysVLM-AVR

PhysVLM-AVR is a multimodal reasoning model for partially observable environments that plans actions, integrates observations over time, and answers questions about physical scenes.

Tool Introduction

Core parameters, trigger timing, and visual before/after demo references.

Short Explanation

Use PhysVLM-AVR when the agent must reason over an evolving scene instead of answering from one static fully observed image.

InputScene image(s) + question or interaction state

OutputAnswer, reasoning trace, confidence, server response JSON

Trigger TimingTriggered on demand after the required input files and configuration are prepared.

RuntimePython / FastAPI / multimodal transformer server

BeforeScene image(s) + question or interaction state

Prepare the scene, image, video, sensor stream, prompt, or configuration expected by the original project.

AfterAnswer, reasoning trace, confidence, server response JSON

Read the produced visualization, prediction, map, trajectory, mask, grasp pose, or other documented artifact.

Preset Example

A quick-run style example for the documentation page.

Inputtools/physvlm_avr/example/initial_environment.png

PromptWhat is the color of the object in front of you?

ExpectedA structured answer JSON with the answer text, reasoning, confidence, and health or fallback mode metadata.

Parameters And Output

Readable controls and the meaning of each returned artifact.

Parameter Explanation

image_pathspath

One or more scene images passed to the inference server.

querytext

Question about the scene or interaction outcome.

deviceselectcpu

Local inference device used by the server entrypoint.

checkpoint_dirpathtools/physvlm_avr/repo/physvlm-avr/checkpoints/physvlm-qwen2-3B-avr-stage3-avr-core-v3

Repository-relative checkpoint path required for real model inference.

Output Explanation

answer

Natural-language answer returned by the model or fallback path.

reasoning

Reasoning text describing how the answer was reached.

confidence

Confidence score returned by the server.

mode

Execution mode such as real-model inference or mock fallback.

How To Use

Official resources, deployment steps, academic context, citation, and source-reported benchmark numbers.

Resources

GitHubhttps://github.com/jetteezhou/PhysVLM-AVR Code Downloadhttps://github.com/jetteezhou/PhysVLM-AVR/archive/refs/heads/main.zip Paperhttps://openreview.net/forum?id=kUN2R6X4jS

Deployment Notes

Create or refresh the `physvlm-avr` Conda environment with the setup script, then install the runtime packages listed in the deployment README.
Place the expected PhysVLM checkpoint under the repository-relative `checkpoints/physvlm-qwen2-3B-avr-stage3-avr-core-v3` folder if real inference is required.
Run `run_example.sh` to generate a scene, start the server, send the example question, and export JSON plus image artifacts to `tools/physvlm_avr/example/`.
If the checkpoint is absent, the deployment falls back to deterministic mock inference while preserving the same API surface.

Relative Path Example

# Relative-path local entry for the PhysVLM-AVR deployment
cd tools/physvlm_avr
./run_example.sh

# Direct Python entry:
conda run -n physvlm-avr python run_physvlm_example.py

# Server-only entry:
cd tools/physvlm_avr/repo/physvlm-avr
conda run -n physvlm-avr env PHYSVLM_PORT=8000 PHYSVLM_DEVICE=cpu python start_physvlm_server.py

Expected Result Shape

{
  "tool": "physvlm_avr",
  "status": "ok",
  "results": [
    {
      "label": "Active visual reasoning",
      "score": 0.87,
      "output": "Answer, reasoning trace, confidence, server response JSON"
    }
  ],
  "timing": {
    "runtime": "The local deployment README does not provide a source-reported latency number; the bundled example runs through a FastAPI server and can fall back to mock mode on CPU.",
    "device": "documented in source benchmark when available"
  },
  "artifacts": {
    "visualization": "tools/physvlm_avr/runs/visualization.png",
    "raw_predictions": "tools/physvlm_avr/runs/predictions.json"
  }
}

Paper figure

Academic Info

Paper identity and contribution summary.

TitlePhysVLM-AVR

AuthorsAdd authors

VenueOpenReview / arXiv preprint, 2025

ContributionIntroduces an active visual reasoning MLLM that combines sequential observation, action-conditioned information gathering, and chain-of-thought reasoning for embodied tasks in partially observable worlds.

Citation

@misc{physvlm_avr2025,
  title={PhysVLM-AVR},
  author={Author},
  year={2025},
  note={OpenReview / arXiv preprint, 2025},
  url={https://openreview.net/forum?id=kUN2R6X4jS}
}

Benchmark

Only compact, source-reported numbers are shown here.

Dataset	Metric	Value	Runtime	Source
CLEVR-AVR	Accuracy	84.2%	Source paper result	OpenReview paper
RoboVQA	Accuracy	78.0%	Source paper result	OpenReview paper

Artifacts

Official repository README, OpenReview paper page, deployment scripts, example input/output JSON, generated scene images, and server logs.

Demo Images

Visual references from the original tool. Click any image to inspect the original size.