The P10K database: a data portal for the protist 10 000 genomes project

Protist are unicellular eukaryotic organisms, including unicellular eukaryotic algae and protozoa. Protists are highly diverse and widely distributed in all kinds of water environments, playing an important role in ecological balance, material and energy cycles, environmental health, and the occurrence of plant and animal diseases. They are important components of the aquatic ecosystem, important primary productivity and oxygen producers, key participants in the carbon cycle, excellent bait for aquatic animals, human nutrients, bioenergy, the “sentinel” of the aquatic environment, the culprits of flushing and red tides, and important pathogens of human, animal, poultry and fish diseases, and “good partners” of mutualistic symbiosis. They are important pathogen for human, animal, poultry and fish diseases, and a “good partner” for mutual coexistence.

The CAS Institute of Aquatic Biology leads the Protist 10,000 Genomes Project (P10K), aiming to establish a large-scale database of protists’ genetic resources, and to change the situation in which there is an extreme lack of data on protists’ genetic resources. Genetic resource data is extremely scarce.

Recently, the Institute of Hydrobiology and the Beijing Institute of Genomics of the Chinese Academy of Sciences (National Center for Biological Information) jointly released the first batch of P10K data. The data were released and shared through the P10K database (, a database of 10,000 native organism genomes. The results were published in Nucleic Acids Research under the title The P10K database: a data portal for the protist 10,000 genomes project.The P10K first batch of data contains 2959 protist datasets, including 1,601 genomic and 1,358 transcriptomic datasets, covering 75% of the phyla and 45% of the orders of native organisms. Among them, the P10K team integrated 1,858 datasets in public databases; 1,101 datasets were newly sequenced and were dominated by protozoan ciliates (Ciliate). The newly sequenced data increased the size of the protozoan dataset by 37% overall. The newly sequenced samples were collected and isolated by the P10K team in a variety of habitats across the country. For the majority of protists that cannot be cultured in the lab, the team used single-cell sequencing methods (about 98% of the newly sequenced data). Meanwhile, to solve the problem of analyzing large-scale single-cell genomics data, the P10K team developed a standardized analysis process for assembly, decontamination, species identification, gene annotation and evaluation of single-cell sequencing data from protozoa. Quality assessment shows that the genomes annotated by this process have a similar proportion of medium to high quality data as those released from public databases.

As an important part of the 10,000 Primary Genomes Project, the establishment and data sharing of the P10K database will help promote research on important basic scientific issues such as the origin of eukaryotes and multicellular organisms, eukaryotic biodiversity, adaptation of primary organisms to extreme environments, and microbial interactions. At the same time, this program will promote the excavation and potential application of genetic resources of protist organisms related to ecological environment protection, pollutant degradation and transformation, nutritional health, and disease control. Meanwhile, given that protists are key components of plankton, the P10K database will support environmental DNA-based plankton identification and help water ecological health assessment.

Especially importantly, the P10K database establishes a link between the National Aquatic Biological Germplasm Resource Bank/National Parasite Resource Bank (living germplasm resources) and the National Genome Science Data Center (genetic resources), which is of great significance in promoting the information interconnection and data sharing of the national scientific and technological resources sharing service platform.

The research is funded by the National Key R&D Program, the Strategic Pilot Science and Technology Project of the Chinese Academy of Sciences (CAS), the International Cooperation Program of the Chinese Academy of Sciences (CAS), the CAS Youth Innovation Promotion, the National Natural Science Foundation of China (NSFC), and the International Union of Biosciences (IUBMB) Open Biodiversity and Big Data for Health Program (OBHBDP), and supported by the National Aquatic Germplasm Repository (NARRB) and the Wuhan Branch of the Chinese Academy of Sciences (WBAC).

