SciQAG框架：革新大模型科学问答能力测评

在自然语言处理（NLP）领域，高质量的问答（QA）数据集对于推动技术进步至关重要。这些数据集不仅能够帮助模型进行微调，还能有效评估其在特定任务上的能力，特别是在科学知识理解与推理方面。然而，现有的科学QA数据集在形式、内容以及评估方法上仍存在局限性，这限制了模型在实际学术研究和生产环境中的应用。为解决这些挑战，美国Argonne国家实验室、芝加哥大学Ian Foster教授（2002年戈登贝尔奖得主）、澳大利亚新南威尔士大学Bram Hoex教授、UNSW AI4Science团队、AI4Science公司GreenDynamics与香港城市大学揭春雨教授团队联合推出了SciQAG框架，旨在提供一个全面、高效的科学复杂问答基准与测评体系。

**SciQAG框架**通过从大型科学文献语料库中自动生成高质量的科学开放性问答对，填补了当前科学QA数据集的不足。这一框架不仅能够生成多样化的问答对，还能够评估模型在科学知识理解、推理以及开放式问题解答上的能力。与传统的多项选择题相比，SciQAG框架更侧重于开放式问答，这能够更全面地评估模型的性能，同时避免了模型答案选择范围受限的问题。

**SciQAG数据集**——SciQAG-24D是基于这一框架生成的数据集，包含了来自24个科学领域的22,743篇科学论文的188,042个高质量的问答对。这一数据集旨在服务大语言模型（LLM）的微调和科学问题解答能力的评估，通过在SciQAG-24D数据集上对LLM进行微调，研究发现模型在开放式问题解答和科学任务中的性能显著提高。更重要的是，SciQAG框架与数据集的开源性质促进了AI for Science社区的合作与创新，加速了科学问答领域的研究与发展。

**评估体系**——SciQAG框架中的评估器使用了综合评估指标RACAR（相关性、不可知性、完整性、准确性、合理性）来衡量生成的问答对质量。通过GPT-4作为评估器，研究团队对生成的问答对进行了评估，并与人工评估进行了对比，以确保评估的一致性与准确性。

### 结论

SciQAG框架与数据集的推出，不仅为大模型提供了全新的科学复杂问答基准与测评体系，还推动了自然语言处理技术在科学领域的应用与研究。这一创新不仅有助于提升模型在科学知识理解与推理方面的性能，也为科学家、教育工作者和AI研究者提供了宝贵的资源，促进了跨学科合作与知识传播。SciQAG框架与数据集的开源性质，更是加速了AI for Science领域的发展，预示着科学问答与自然语言处理技术的未来将有更多可能。

英语如下：

### The Dawn of Scientific Question Answering: SciQAG Framework Pioneers the Evolution of Large Model Evaluation

In the domain of Natural Language Processing (NLP), high-quality question answering (QA) datasets are indispensable for driving technological advancements. These datasets not only aid in fine-tuning models but also effectively evaluate their capabilities in specific tasks, particularly in the realm of scientific knowledge comprehension and inference. However, existing scientific QA datasets still fall short in terms of their format, content, and evaluation methodologies, limiting their applicability in real-world academic research and industrial settings. To address these challenges, a collaborative effort involving the Argonne National Laboratory in the US, Professor Ian Foster (recipient of the 2002 ACM Gordon Bell Prize), Professor Bram Hoex from the University of New South Wales (UNSW), the AI4Science team from UNSW, GreenDynamics, an AI4Science company, and Professor Jie Chunyu’s team from Hong Kong City University has introduced the SciQAG framework. The aim is to provide a comprehensive and efficient benchmark and evaluation system for scientific complex question answering.

**SciQAG Framework** leverages large scientific literature corpora to automatically generate high-quality scientific open-ended question pairs, thus addressing the inadequacies of current scientific QA datasets. This framework is designed to produce diverse question pairs that evaluate a model’s capabilities in scientific knowledge understanding, inference, and answering open-ended questions, diverging from the traditional multiple-choice format that often restricts model answer choices.

**SciQAG Dataset** — SciQAG-24D is a dataset based on this framework, comprising 188,042 high-quality question pairs from 22,743 scientific papers across 24 scientific domains. This dataset is intended to serve as a resource for the fine-tuning of large language models (LLMs) and the assessment of their scientific question answering abilities. Research findings indicate a significant improvement in model performance for open-ended question answering and scientific tasks when fine-tuned on the SciQAG-24D dataset. More importantly, the open-source nature of the SciQAG framework and dataset fosters collaboration and innovation within the AI for Science community, accelerating research and development in the field of scientific question answering.

**Evaluation System** — The evaluator within the SciQAG framework employs the RACAR (Relevance, Ambiguity, Completeness, Accuracy, Rationality) metric to gauge the quality of generated question pairs. Through GPT-4 as the evaluator, the research team assessed the generated question pairs and compared them with human evaluations to ensure consistency and accuracy in the assessment process.

### Conclusion

The introduction of the SciQAG framework and dataset not only provides a new scientific complex question answering benchmark and evaluation system for large models but also propels the application and research of NLP technology in the scientific domain. This innovation not only enhances the performance of models in scientific knowledge understanding and inference but also offers valuable resources to scientists, educators, and AI researchers, promoting interdisciplinary collaboration and knowledge dissemination. The open-source nature of the SciQAG framework and dataset further accelerates the development of the AI for Science field, signaling a promising future for scientific question answering and NLP technology.

【来源】https://www.jiqizhixin.com/articles/2024-07-24-7