CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

yhzan 7 hours ago

Introducing CapRL, the first study of applying GRPO for the open-ended and subjective image captioning task. The trained CapRL-3B model achieves image captioning performance comparable to Qwen2.5-VL-72B. CapRL introduces a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. Currently, CapRL is open-sourced, with total downloads of the models and datasets surpassing 7,000. The research team is continuously iterating with stronger base models and improved training recipe.