Scientists Release Massive Scientific Document Dataset to Tackle Data Scarcity

Shanghai, China – A team of researchers from the ShanghaiArtificial Intelligence Laboratory, Shanghai Jiao Tong University, Zhejiang University, and Fudan University has released DocGenome, the first large-scale, multi-modal structured scientific document benchmarkdataset.

This dataset aims to train and test multi-modal large language models (LLMs) and fully leverage the value of scientific literature for AI systems.DocGenome, as a vast repository of structured scientific literature, records research findings and human knowledge, providing crucial support for research and applications in automated multi-modal scientific document understanding and AI scientific question discovery.

The dataset, which was automatically annotatedfrom 500,000 scientific documents from the open-access community on the preprint website arXiv, boasts four key features: completeness, logic, diversity, and correctness. It was annotated using a custom automated pipeline.

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models, the paper detailing the dataset, was published on arXiv [1].

Inspired by the Visual Genome dataset, which was introduced by Professor Fei-Fei Li’s team at Stanford University in 2016,DocGenome goes beyond unary region-level annotations for each scientific document. It also annotates binary relationships between regions.

For example, the reading order between different paragraphs, the citation relationships between different regions, etc., which are very helpful in alleviating hallucinations in large models and improving their writing logic, explained Dr.Bo Zhang, a researcher at the Shanghai Artificial Intelligence Laboratory and the corresponding author of the paper.

Previous research has shown that LLMs often only grasp the intuitive writing logic of papers, not the crucial experimental logic, due to insufficient data and limited logical reasoning capabilities when dealing with scientific documents.

To address the challenges ofdata scarcity and high annotation costs in scientific document understanding, the research team developed DocParser, an automated scientific document structural annotation tool.

Structuring annotations poses a significant challenge because each paper has unique compilation libraries and environment packages. This necessitates a unified and automated approach for handling papers written by different authors in varying styles.

DocParser consists of four key modules: context and data preprocessing, unit segmentation, attribute assignment and relationship retrieval, and unit rendering. These modules enable the automatic extraction and structural annotation of scientific document data from raw data on the arXiv open-source community.

Dr. Zhang highlighted that DocParser, as the core tool in the datasetannotation process, automatically annotated 500,000 arXiv scientific documents (with both unary and binary relationship annotations), saving an estimated 4-5 million yuan in manual annotation costs.

From a unary relationship perspective, DocGenome enables switching between different complex modalities, such as visual tables, formula-to-text tables, and formula tasks. This opens up possibilities for enriching the application scenarios of document type conversion.

Furthermore, DocGenome encompasses various complex modality categories, including charts, equations, tables, algorithms, code, and footnotes.

In terms of binary relationships, DocGenome establishes six types of binary logical relationships between different regions: