Reinforced Visual Perception with Tools

This repository contains the official implementation for the paper "Reinforced Visual Perception with Tools".

Our work introduces REVPT, a novel framework designed to enhance the visual perception capabilities of multimodal large language models (MLLMs) through reinforcement learning (RL). ReVPT trains models to effectively reason about and utilize external visual tools, such as object detection, zoom-in, edge detection, and depth estimation, to solve complex visual perception tasks.

Installation

conda create -n revpt python=3.10 -y
conda activate revpt
pip install torch==2.6.0 torchvision==0.21.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e ".[vllm]"

conda create -n tools python=3.10 -y
conda activate tools
pip install torch==2.4.1 torchvision==0.19.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.42.0 fastapi uvicorn matplotlib opencv-python python-multipart

Tool services

change config in tools/tools_config_2.json

cd Depth-Anything-V2
mkdir checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true

python tools/lanuch_tools.py --config tools_config_2.json

Train

You can download data from here. Put them under data.

Generate data using the following command:

python data/sat_jsonl.py --local-dir [LOCAL_DIR]

Change config in scripts/run.sh

bash scripts/run.sh

Eval

You can download data from here. Put them under data.

First you need to deploy vllm servers. You can use our script:

bash scripts/deploy.sh [MODEL_PATH] [MODEL_NAME] [CUDA_DEVICES] [STARTING_PORT]

Datasets and prompts can be found in eval/agent_eval.py. You can run evaluation like this:

cd eval
python agent_eval.py --model-name [MODEL_NAME] --port-pool [PORT_POOL] --workers [WORKERS] --dataset [DATASET] --prompt [PROMPT] --evaluate

The parameter 'evaluate' will use regex to extract answer in \boxed{} to compare with ground truth answer.

You can use benchmark.sh to run all datasets at once.

Citation

@article{zhou2025reinforced,
  title={Reinforced Visual Perception with Tools},
  author={Zhou, Zetong and Chen, Dongping and Ma, Zixian and Hu, Zhihan and Fu, Mingyang and Wang, Sinan and Wan, Yao and Zhao, Zhou and Krishna, Ranjay},
  journal={arXiv preprint arXiv:2509.01656},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinforced Visual Perception with Tools

Installation

Tool services

Train

Eval

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Reinforced Visual Perception with Tools

Installation

Tool services

Train

Eval

Citation