**浙大与腾讯合作发布科学LLM大规模评测基准:国产大模型展现强大性能**
近日,浙江大学NLP实验室与腾讯AI Lab的研究者构建了一个全新的科学智能评测基准——SciKnowEval,填补了科学领域全面评估大型语言模型(LLMs)的空白。该评测基准定义了从L1到L5的不同层级科学智能,涵盖了化学和生物领域的50,000个测评题目。
这一重要基准被用于测试了包括开源和闭源在内的20个大型语言模型。结果显示,规模庞大、参数在千亿至万亿之间的模型,如GPT-4o、Gemini1.5-Pro和Claude3-Sonnet表现尤为出色。这些大型模型的整体性能显著优于中小型开源模型,如Qwen1.5和Llama3等。这一研究标志着科学研究领域大型语言模型性能评估的新里程碑。
这一研究成果以《SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models》为题,已发布在预印平台arXiv上,论文链接为:[链接地址]。这一基准的发布对于推动LLMs在科学研究中的应用和发展具有重要意义。它不仅为研究者提供了一个统一的评估工具,也为未来模型的改进和优化指明了方向。随着科学的不断进步,我们期待更多高质量的LLMs在科研领域发挥更大的作用。
英语如下:
News Title: Zhejiang University and Tencent Jointly Release Scientific LLM Evaluation Benchmark: Big Models Impress
Keywords: Joint Research between Zhejiang University and Tencent, SciKnowEval Benchmark, Domestic Big Models Perform Brightly
News Content: **Zhejiang University and Tencent Collaborate to Release a Comprehensive Scientific LLM Evaluation Benchmark: Domestic Big Models Demonstrate Impressive Performance**
Recently, researchers from the NLP Lab at Zhejiang University and Tencent AI Lab have constructed a new benchmark for evaluating scientific intelligence – SciKnowEval. This benchmark fills the gap in comprehensive evaluation of large language models (LLMs) in the scientific field. The benchmark defines different levels of scientific intelligence from L1 to L5, covering 50,000 assessment questions in the fields of chemistry and biology.
This important benchmark was used to test 20 large language models, including both open-source and closed-source models. The results showed that large models with parameters in the range of hundreds of billions to trillions, such as GPT-4o, Gemini1.5-Pro, and Claude3-Sonnet, performed particularly well. These large models significantly outperformed small and medium-sized open-source models like Qwen1.5 and Llama3. This research marks a new milestone in evaluating the performance of large language models in the field of scientific research.
The research findings, with the title “SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models”, have been published on the preprint platform arXiv. The paper link is: [Link Address]. The release of this benchmark is of great significance for promoting the application and development of LLMs in scientific research. It not only provides researchers with a unified evaluation tool but also points out the direction for future model improvement and optimization. With the continuous progress of science, we look forward to more high-quality LLMs playing a greater role in the field of scientific research.
【来源】https://www.jiqizhixin.com/articles/2024-07-02-2
Views: 2