Introducing CapRL, the first study of applying GRPO for the open-ended and subjective image captioning task. The trained CapRL-3B model achieves image captioning performance comparable to Qwen2.5-VL-72B. CapRL introduces a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. Currently, CapRL is open-sourced, with total downloads of the models and datasets surpassing 7,000. The research team is continuously iterating with stronger base models and improved training recipe.
Introducing CapRL, the first study of applying GRPO for the open-ended and subjective image captioning task. The trained CapRL-3B model achieves image captioning performance comparable to Qwen2.5-VL-72B. CapRL introduces a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. Currently, CapRL is open-sourced, with total downloads of the models and datasets surpassing 7,000. The research team is continuously iterating with stronger base models and improved training recipe.