📃Paper | 🤗Models & Datasets Repo
This repository contains the official implementation for the paper "Reinforced Visual Perception with Tools".
Our work introduces REVPT, a novel framework designed to enhance the visual perception capabilities of multimodal large language models (MLLMs) through reinforcement learning (RL). ReVPT trains models to effectively reason about and utilize external visual tools, such as object detection, zoom-in, edge detection, and depth estimation, to solve complex visual perception tasks.
conda create -n revpt python=3.10 -y
conda activate revpt
pip install torch==2.6.0 torchvision==0.21.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e ".[vllm]"
conda create -n tools python=3.10 -y
conda activate tools
pip install torch==2.4.1 torchvision==0.19.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.42.0 fastapi uvicorn matplotlib opencv-python python-multipartchange config in tools/tools_config_2.json
cd Depth-Anything-V2
mkdir checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true
python tools/lanuch_tools.py --config tools_config_2.jsonYou can download data from here. Put them under data.
Generate data using the following command:
python data/sat_jsonl.py --local-dir [LOCAL_DIR]Change config in scripts/run.sh
bash scripts/run.shYou can download data from here. Put them under data.
First you need to deploy vllm servers. You can use our script:
bash scripts/deploy.sh [MODEL_PATH] [MODEL_NAME] [CUDA_DEVICES] [STARTING_PORT]Datasets and prompts can be found in eval/agent_eval.py. You can run evaluation like this:
cd eval
python agent_eval.py --model-name [MODEL_NAME] --port-pool [PORT_POOL] --workers [WORKERS] --dataset [DATASET] --prompt [PROMPT] --evaluateThe parameter 'evaluate' will use regex to extract answer in \boxed{} to compare with ground truth answer.
You can use benchmark.sh to run all datasets at once.
@article{zhou2025reinforced,
title={Reinforced Visual Perception with Tools},
author={Zhou, Zetong and Chen, Dongping and Ma, Zixian and Hu, Zhihan and Fu, Mingyang and Wang, Sinan and Wan, Yao and Zhao, Zhou and Krishna, Ranjay},
journal={arXiv preprint arXiv:2509.01656},
year={2025}
}