华盛顿大学的研究团队开发了一种名为Casanovo的机器学习模型,该模型使用Transformer神经网络架构将串联质谱中的峰序列转换为构成生成肽的氨基酸序列。这一创新为蛋白质组学领域带来了革命性的进步,特别是在抗体测序、免疫肽组学和元蛋白质组学等领域。
Casanovo模型通过训练其神经网络以处理3000万个标记光谱,实现了对跨物种基准数据集的卓越性能,超越了现有的先进方法。这一成就表明,Casanovo不仅能够检测和识别意外肽,还能够在无需先验信息的情况下,将肽序列分配到串联质谱中,这对于蛋白质组学的深入研究具有重要意义。
此外,研究团队还针对非酶肽进行了Casanovo模型的微调,这一改进进一步提升了其在免疫肽组学和宏蛋白质组学实验中的分析能力,使得科学家能够更深入地研究暗蛋白质组,即那些难以通过传统方法鉴定的蛋白质。
这项研究成果发表在《Nature Communications》上,为质谱法在蛋白质组学分析中的应用提供了新的视角。质谱法是目前最主流的分析技术,用于鉴定蛋白质组,识别和量化复杂生物系统中的蛋白质。串联质谱(MS/MS)技术产生的数据非常复杂,将这些光谱转换成蛋白质氨基酸序列的过程一直是一个挑战。Casanovo模型通过将从头肽测序任务重新定义为机器翻译问题,利用Transformer架构的优势,直接使用构成MS/MS光谱的m/z和强度值对,而无需对m/z轴进行离散化,直接输出预测的肽序列,简化了这一过程。
Casanovo模型的成功不仅得益于其庞大的高质量训练数据集,还在于Transformer架构的优越性,它能够学习序列元素之间的长距离依赖关系,并且可以并行化以实现高效训练。这一突破性进展预示着未来在蛋白质组学和其他生物学领域中,深度学习模型将发挥更大的作用。
英语如下:
News Title: “Washington University Innovation: Revolutionizing the Translation of Peptide Sequences from Mass Spectrometry”
Keywords: Mass Spectrometry, Peptide Sequences, Transformer
News Content: A research team from the University of Washington has developed a machine learning model named Casanovo, which utilizes the Transformer neural network architecture to convert the peak sequences in tandem mass spectrometry into the amino acid sequences that constitute generating peptides. This innovation has brought revolutionary progress to the field of proteomics, particularly in areas such as antibody sequencing, immunopeptidomics, and metaproteomics.
The Casanovo model achieved outstanding performance on a cross-species benchmark dataset by training its neural network on 30 million labeled spectra, surpassing existing advanced methods. This achievement indicates that Casanovo is not only capable of detecting and identifying unexpected peptides but also of assigning peptide sequences to tandem mass spectrometry without prior information, which is of great significance for in-depth research in proteomics.
Additionally, the research team has fine-tuned the Casanovo model for non-enzymatic peptides, further enhancing its analytical capability in immunopeptidomics and metaproteomics experiments, enabling scientists to delve deeper into the dark proteome, which refers to proteins that are difficult to identify through traditional methods.
This research has been published in Nature Communications, offering a new perspective on the application of mass spectrometry in proteomics analysis. Mass spectrometry is currently the most mainstream analytical technique for identifying protein groups, identifying and quantifying proteins in complex biological systems. The data produced by tandem mass spectrometry (MS/MS) technology is very complex, and the process of converting these spectra into protein amino acid sequences has always been a challenge. The Casanovo model redefines the de novo peptide sequencing task as a machine translation problem, leveraging the advantages of the Transformer architecture to directly use the pairs of m/z and intensity values that constitute the MS/MS spectra, without discretizing the m/z axis, and directly outputting predicted peptide sequences, simplifying this process.
The success of the Casanovo model is not only due to its large and high-quality training dataset but also to the superiority of the Transformer architecture, which can learn long-range dependencies between sequence elements and can be parallelized for efficient training. This breakthrough progress foreshadows that in the future, deep learning models will play a greater role in proteomics and other biological fields.
【来源】https://www.jiqizhixin.com/articles/2024-08-11-14
Views: 1