Skip to content

Latest commit

 

History

History
executable file
·
76 lines (60 loc) · 3.04 KB

File metadata and controls

executable file
·
76 lines (60 loc) · 3.04 KB

Reinforced Visual Perception with Tools

📃Paper | 🤗Models & Datasets Repo

This repository contains the official implementation for the paper "Reinforced Visual Perception with Tools".

Our work introduces REVPT, a novel framework designed to enhance the visual perception capabilities of multimodal large language models (MLLMs) through reinforcement learning (RL). ReVPT trains models to effectively reason about and utilize external visual tools, such as object detection, zoom-in, edge detection, and depth estimation, to solve complex visual perception tasks.

framework

Installation

conda create -n revpt python=3.10 -y
conda activate revpt
pip install torch==2.6.0 torchvision==0.21.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e ".[vllm]"

conda create -n tools python=3.10 -y
conda activate tools
pip install torch==2.4.1 torchvision==0.19.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.42.0 fastapi uvicorn matplotlib opencv-python python-multipart

Tool services

change config in tools/tools_config_2.json

cd Depth-Anything-V2
mkdir checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true

python tools/lanuch_tools.py --config tools_config_2.json

Train

You can download data from here. Put them under data.

Generate data using the following command:

python data/sat_jsonl.py --local-dir [LOCAL_DIR]

Change config in scripts/run.sh

bash scripts/run.sh

Eval

You can download data from here. Put them under data.

First you need to deploy vllm servers. You can use our script:

bash scripts/deploy.sh [MODEL_PATH] [MODEL_NAME] [CUDA_DEVICES] [STARTING_PORT]

Datasets and prompts can be found in eval/agent_eval.py. You can run evaluation like this:

cd eval
python agent_eval.py --model-name [MODEL_NAME] --port-pool [PORT_POOL] --workers [WORKERS] --dataset [DATASET] --prompt [PROMPT] --evaluate

The parameter 'evaluate' will use regex to extract answer in \boxed{} to compare with ground truth answer.

You can use benchmark.sh to run all datasets at once.

Citation

@article{zhou2025reinforced,
  title={Reinforced Visual Perception with Tools},
  author={Zhou, Zetong and Chen, Dongping and Ma, Zixian and Hu, Zhihan and Fu, Mingyang and Wang, Sinan and Wan, Yao and Zhao, Zhou and Krishna, Ranjay},
  journal={arXiv preprint arXiv:2509.01656},
  year={2025}
}