Base pairs on DNA molecule Illustration: VCG
The BGI-Research and the Zhejiang Lab on Thursday jointly released Genos, the world's first deployable general foundational model for genomes with 10 billion parameters, which supports ultra-long context analysis of up to 1 million base pairs and achieves accurate identification at single-base resolution.
The human genome consists of 3 billion base pairs. The "Human Genome Project" decoded the sequence, but the specific functions of the bases in the sequence still require precise identification and analysis.
Most existing models are trained based on 1-2 reference genomes, making it difficult to reflect the diversity of human genetic resources. However, Genos integrates multiple authoritative public resources such as the Human Pangenome Reference Consortium and the Human Genome Structural Variation Consortium, the Global Times learned from the BGI, a Chinese company specializing in DNA/gene sequencing.
For the first time, it uses 636 high-quality "telomere-to-telomere" human genomes from around the world as training data. These data cover different populations globally, helping gain a more comprehensive understanding of human genetic diversity.
In terms of algorithm architecture, Genos adopts a "mixture of experts" framework to accurately schedule strongly relevant "expert" algorithms for collaborative processing. While aggregating massive 10-billion-level parameters, it successfully reduces inference costs and resource consumption, making the model both powerful and easy to use.
Test results show that Genos has achieved an accuracy rate of 92 percent in the task of interpreting pathogenic mutations directly for clinical applications; when combined with scientific foundational models, the accuracy rate reaches as high as 98.3 percent. Multiple comprehensive evaluation results also indicate that Genos surpasses the current best-performing models.
A staff member in charge of BGI-Research revealed that the Genos model has been fully open-sourced on machine learning and AI tool platforms such as HuggingFace and ModelScope, providing two versions with 1.2 billion and 10 billion parameters to meet different needs.
In scientific research, the Genos model works alongside BGI's DCS Cloud to predict RNA expression profiles from DNA sequences in mere seconds, drastically speeding up bioinformatics workflows that previously took weeks or months.
Its integration into the China National GeneBank DataBase (CNGBdb) allows users to accurately forecast cell expression levels, effectively pinpoint and validate critical candidate genes, and greatly enhance the speed of scientific discoveries.
For clinical use, the Genos model, paired with BGI's GeneT deep reasoning model, provides advanced multimodal interpretations to aid in diagnosing genetic diseases.
In personal health applications, the seamless integration of the Genos model into BGI's BGE platform facilitates the analysis of personal genomic data, transforming complex genetic information into easy-to-understand, personalized health reports.
The launch of Genos marks a crucial turning point in genomic research, moving from "reading" base sequences of DNA to "understanding" the underlying logic of life. It is expected to bring breakthrough changes to clinical disease diagnoses, personal genome interpretation and cutting-edge scientific research, according to the BGI.