BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

🔔News

🔥[2025-07-08]: We have released the dataset and the evaluation scripts on GitHub! We would greatly appreciate it if you could check them out and consider giving us a star!🌟

Introduction

We introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats—multiple-choice, fill-in-the-blank, and open-ended QA—and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We have released the data, and we hope our work can offers insights and contributions to the community.

Overview

The BMMR dataset is proposed to support the evaluation and development of multimodal foundation models in college-level, multidisciplinary knowledge, understanding, and reasoning. It comprises 110k items spanning 300 UNESCO-defined subfields across 8 high-level disciplines.

BMMR is bilingual (English and Chinese) and sourced from both print and digital media, including books, exams, and quizzes. This variety of sources inevitably introduces uncertainty in data quality. We design specific procedures to ensure question diversity, complexity, and answer verifiability. We also re-organize the original questions—through rewriting and augmentation—into multiple-choice, fill-in-the-blank, and open-ended QA formats to minimize the impact of model memorization and guessing. Each retained instance requires cross-modal understanding, domain-specific expertise, and advanced reasoning skills to solve. To support the research community, each instance is paired with a high-quality reasoning path.

BMMR is splited into two subsets: BMMR-Eval, containing 20,458 examples, and BMMR-Train, containing 88,991 examples. Specifically, BMMR-Eval is designed to comprehensively assess LMMs’ perception, knowledge, and reasoning across a broad range of disciplines and difficulty levels; BMMR-Train supports the community’s research and development of next-generation multimodal foundation models, extending the current focus of the community on mathematical reasoning to diverse disciplines and domains.

Comparisons with Existing Benchmarks

We compare our BMMR-Eval with other benchmarks regarding size and diversity(top-left). A comparison of model performance on BMMR-Eval versus MMMU is also included(top-right), highlighting the challenging nature of our test set.

Overall comparison between BMMR-Eval and other existing benchmarks. In the Source column, D means digital-based data sources, such as websites and existing datasets; P means print-based data sources, such as college textbooks and exams; R means repurposed data sources. The column Multiple Images implies the presence of questions that contains multiple images. In the Question Type column, MC means multiple-choice questions, FIB means fill-in-the-blank questions, ans OE means open-ended questions, TF means true-or-false questions. (t) in the Language column means ''translated''. In the Difficulty column, C means college level, K means K-12 level, and H means high-school level. Information for R-Bench only cover its multimodal subset. For all datasets, we only report statistics on their test split.

Statistics

Subject distribution of the BMMR Dataset

Key statistics of the BMMR Dataset

Error Analysis

We conduct a fine-grained error analysis on 19k responses sampled from different models. We provide the incorrect reasoning responses to GPT-4o for error classification. We observe that the largest portion of errors stems from a lack of domain knowledge, which highlights the broad multidisciplinary knowledge coverage of BMMR-Eval. The second and third most frequent types of errors originate from computation, derivation, and reasoning; this also validates our dataset's demand for System-2 reasoning capabilities. We point out that developing next-generation LMMs and LRMs needs to simultaneously considering different aspects, including visual understanding capabilities, reasoning skills, and multidisciplinary knowledge.

Error Examples

BibTeX

@misc{xi2025bmmrlargescalebilingualmultimodal,
        title={BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset}, 
        author={Zhiheng Xi and Guanyu Li and Yutao Fan and Honglin Guo and Yufang Liu and Xiaoran Fan and Jiaqi Liu and Jingchao Ding and Wangmeng Zuo and Zhenfei Yin and Lei Bai and Tao Ji and Tao Gui and Qi Zhang and Xuanjing Huang},
        year={2025},
        eprint={2507.03483},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2507.03483}, 
  }