Corvid

Improving Multimodal Large Language Models Towards
Chain-of-Thought Reasoning
1 Shanghai Jiao Tong University, 2Nanyang Technological University
ICCV 2025
corvid Overview

◈ Corvid is an MLLM with enhanced CoT reasoning capabilities. Architecturally, it incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. During inference, Corvid implements a self-verification strategy to mitigate over-reasoning on easy samples and under-reasoning on hard ones.

corvid Training Data Curation and Procedure

Training Data Curation. We initially construct a multimodal CoT-formatted instruction dataset, MCoT-Instruct, and subsequently introduce MAG-1M, Corvid-1M, and o1-320K to support the three-stage training of Corvid.


Dataset Image
◈ Source data summary. Numbers in parentheses represent the number of conversation instances used for each dataset.

Training Procedure. Corvid undergoes the following three-stage training:

  • Stage 1: Multi-Grained Alignment Pretraining. Training GateMixer on MGA-1M to achieve semantic alignment and connection between image and text within textual embedding space.
  • Stage 2: CoT-Enhanced Supervised Fine-tuning. Jointly training GateMixer and LLM on Corvid-1M to enable Corvid to follow instructions and perform CoT reasoning.
  • Stage 3: Pure-CoT Instruction Tuning. Training Corvid on o1-320K to further strengthen its reasoning capability.

corvid Overall Performance

Evaluation on Multimodal Reasoning Benchmarks. Corvid is evaluated on nine benchmarks to assess its problem-solving and mathematical reasoning capabilities.


Dataset Image
◈ Comparison with state-of-the-art MLLMs on multimodal reasoning benchmarks.

corvid Corvid's Response Demonstration

Mathematical Reasoning

Image 1 Image 2

Vision Cognition

Image 1 Image 2

Problem Solving

Image 1 Image 2

Failure Case

Image 1 Image 2

BibTeX

  
  @inproceedings{jiang2025corvid,
     title={Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning},
     author={Jiang, Jingjing and Ma, Chao and Song, Xurui and Zhang, Hanwang and Luo, Jun},
     booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
     year={2025}
  }