corvid

◉ Training Data Curation. We initially construct a multimodal CoT-formatted instruction dataset, MCoT-Instruct, and subsequently introduce MAG-1M, Corvid-1M, and o1-320K to support the three-stage training of Corvid.

◈ Source data summary. Numbers in parentheses represent the number of conversation instances used for each dataset.

◉ Training Procedure. Corvid undergoes the following three-stage training:

Stage 1: Multi-Grained Alignment Pretraining. Training GateMixer on MGA-1M to achieve semantic alignment and connection between image and text within textual embedding space.
Stage 2: CoT-Enhanced Supervised Fine-tuning. Jointly training GateMixer and LLM on Corvid-1M to enable Corvid to follow instructions and perform CoT reasoning.
Stage 3: Pure-CoT Instruction Tuning. Training Corvid on o1-320K to further strengthen its reasoning capability.

◉ Evaluation on Multimodal Reasoning Benchmarks. Corvid is evaluated on nine benchmarks to assess its problem-solving and mathematical reasoning capabilities.

◈ Comparison with state-of-the-art MLLMs on multimodal reasoning benchmarks.

Mathematical Reasoning

Vision Cognition

Problem Solving

Failure Case

BibTeX

  
  @inproceedings{jiang2025corvid,
     title={Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning},
     author={Jiang, Jingjing and Ma, Chao and Song, Xurui and Zhang, Hanwang and Luo, Jun},
     booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
     year={2025}
  }

Corvid

Training Data Curation and Procedure

Overall Performance

Corvid's Response Demonstration

Mathematical Reasoning

Vision Cognition

Problem Solving

Failure Case

BibTeX