MMLongBench-Doc Leaderboard

📚 MMLongBench-Doc is a long-context, multimodal document understanding benchmark designed to evaluate the performance of large multimodal models on complex document understanding tasks.

📊 This leaderboard tracks the performance of various models on the MMLongBench-Doc benchmark, focusing on their ability to understand and process long documents with both text and visual elements.

🔧 You can use the official GitHub repo or VLMEvalKit to evaluate your model on MMLongBench-Doc. We provide the official evaluation results of GPT-4.1 and GPT-4o.

📝 To add your own model to the leaderboard, please send an Email to yubo001@e.ntu.edu.sg or zangyuhang@pjlab.org.cn then we will help with the evaluation and updating the leaderboard.

Model	Release Date	HF Model	MoE	Parameters	Open Source	ACC Score
🥇 Qwen3-VL-235B-A22B-Instruct	2025-09	🤗	✓	45.9B activated (456B total)	✓	57.0

Model

Release Date

HF Model

MoE

Parameters

Open Source

ACC Score

🥇 Qwen3-VL-235B-A22B-Instruct

2025-09

🤗

✓

45.9B activated (456B total)

✓

57.0

Model	Release Date	HF Model	MoE	Parameters	Open Source	ACC Score
🥇 Qwen3-VL-235B-A22B-Instruct	2025-09	🤗	✓	22B activated (235B total)	✓	57.0
🥈 Qwen3-VL-235B-A22B-Thinking	2025-09	🤗	✓	22B activated (235B total)	✓	56.2
🥉 GPT-4.1 2025-04-14 detail high	2025-04	-	-	-	✗	49.7
GPT-4o 2024-11-20 detail high	2024-11	-	-	-	✗	46.3
GLM-4.5V	2025-07	🤗	✓	12B activated (106B total)	✓	44.7
GLM-4.1V-Thinking	2025-07	🤗	✗	9B	✓	42.4
Kimi-VL-Thinking-2506	2025-06	🤗	✓	2.8B activated (16B total)	✓	42.1
Qwen2.5-VL-72B	2025-02	🤗	✗	72B	✓	35.2
Kimi-VL-A3B-Instruct	2025-04	🤗	✓	2.8B activated (16B total)	✓	35.1
MiniMax-VL-01	2025-01	🤗	✓	45.9B activated (456B total)	✓	32.5
Aria	2024-10	🤗	✓	3.9B activated (25.3B total)	✓	28.3
Qwen2.5-VL-7B	2025-02	🤗	✗	7B	✓	25.1