Reinforced Visual Perception with Tools

This repository contains the official implementation for the paper "Reinforced Visual Perception with Tools".

Our work introduces REVPT, a novel framework designed to enhance the visual perception capabilities of multimodal large language models (MLLMs) through reinforcement learning (RL). ReVPT trains models to effectively reason about and utilize external visual tools, such as object detection, zoom-in, edge detection, and depth estimation, to solve complex visual perception tasks.

Installation

conda create -n revpt python=3.10 -y
conda activate revpt
pip install torch==2.6.0 torchvision==0.21.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e ".[vllm]"

conda create -n tools python=3.10 -y
conda activate tools
pip install torch==2.4.1 torchvision==0.19.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.42.0 fastapi uvicorn matplotlib opencv-python python-multipart

Tool services

change config in tools/tools_config_2.json

cd Depth-Anything-V2
mkdir checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true

python tools/lanuch_tools.py --config tools_config_2.json

Train

You can download data from here. Put them under data.

Generate data using the following command:

python data/sat_jsonl.py --local-dir [LOCAL_DIR]

Change config in scripts/run.sh

bash scripts/run.sh

Eval

You can download data from here. Put them under data.

First you need to deploy vllm servers. You can use our script:

bash scripts/deploy.sh [MODEL_PATH] [MODEL_NAME] [CUDA_DEVICES] [STARTING_PORT]

Datasets and prompts can be found in eval/agent_eval.py. You can run evaluation like this:

cd eval
python agent_eval.py --model-name [MODEL_NAME] --port-pool [PORT_POOL] --workers [WORKERS] --dataset [DATASET] --prompt [PROMPT] --evaluate

The parameter 'evaluate' will use regex to extract answer in \boxed{} to compare with ground truth answer.

You can use benchmark.sh to run all datasets at once.

Citation

@article{zhou2025reinforced,
  title={Reinforced Visual Perception with Tools},
  author={Zhou, Zetong and Chen, Dongping and Ma, Zixian and Hu, Zhihan and Fu, Mingyang and Wang, Sinan and Wan, Yao and Zhao, Zhou and Krishna, Ranjay},
  journal={arXiv preprint arXiv:2509.01656},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
data		data
eval		eval
scripts		scripts
tools		tools
verl		verl
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforced Visual Perception with Tools

Installation

Tool services

Train

Eval

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforced Visual Perception with Tools

Installation

Tool services

Train

Eval

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages