Notes ----- Sequences and annotations from the 2024_04 release of the MGnify protein database. Protein sequences are derived from the analysis of publicly available metagenomics assemblies within MGnify using our combined gene caller (which uses both Prodigal and FragGeneScan). Sequences are assigned an MGYP accession. MGYPs are non-redundant, so proteins with the same sequence are assigned the same MGYP identifier. Sequences are mapped to the assembly (ERZ) and contig (MGYC) location where they are found. The biome assignments come from the biome assigned to the assembly in MGnify. The sequences are then clustered using MMseqs/linclust, with coverage and sequence identity thresholds set at 90%. Pfam annotations are provided for the proteins by running HMMER using the Pfam significance thresholds (i.e. using --cut-ga parameter in HMMER). In the fasta files the header line includes fields FL (1=full-length sequence, 0=partial sequence) and CR (1=cluster representative). Statistics ---------- Sequence Cluster Sequences: Total 2455939992 717738164 Full 474611761 165275202 Partial 1981328231 552462962 Residues 465976885243 121126699551 Cite us ------- To cite MGnify, please refer to the following publication: Richardson LJ, Allen B, Baldi G, Beracochea M, Bileschi M, Burdett T, Burgin J, Caballero-Pérez J, Cochrane G, Colwell L, Curtis T, Escobar-Zepeda A, Gurbich T, Kale V, Korobeynikov A, Raj S, Rogers AB, Sakharova E, Sanchez S, Wilkinson D and Finn RD (2023). MGnify: the microbiome sequence data analysis resource in 2023 Nucleic Acids Research (2023) doi:10.1093/nar/gkac1080 Copyright notice ---------------- The MGnify protein database is provided under the terms of the CC0 1.0 Universal (CC0 1.0) licence (https://creativecommons.org/publicdomain/zero/1.0/). Files ----- LICENCE.txt - licence under which protein database is made available md5sum.txt - md5 checksum for all files included in release - format = checksum | filename mgy_assemblies.tsv.gz - accessions of assemblies (ERZ) in which each MGYP is found - format = MGYP | ERZ (multiples separated by ";") mgy_biome_counts.tsv.gz - observation counts of each biome - format = count | biome path mgy_biomes.tsv.gz - biomes associated with each sequence. Sequences identified in multiple biomes will have multiple rows (one per-biome) in the table - format = MGYP | count | biome mgy_proteins_pfam.tsv.gz - Pfam annotations generated by HMMER for all MGPs that result in an annotation - format = MGYP | Pfam accession | i-Evalue | bit score for the domain | from (hmm coord) | to (hmm coord) | from (env coord) | to (env coord) mgy_cluster_seqs.tsv.gz - list of cluster representatives (column 1) with cluster member sequences (column 2) - format = MGYP | MGYP (multiples separated by ";") mgy_clusters.fa.gz - fasta file of cluster representative sequences. Header line includes MGYP and FL mgy_clusters.tsv.gz - statistics and biome information for clusters - format = MGYP of CR | # sequences (cluster size) | # identical sequences | # assemblies contributing to cluster | redundant list of biomes for CR | non-redundant list of biomes in whole cluster mgy_counts.tsv.gz - observation counts of each MGYP - format = MGYP | count mgy_proteins_N.fa.gz - fasta sequences for all MGYPs. File split to reduce file size. Header information includes MGYP, FL and CR mgy_seq_metadata_N.tsv.gz - mapping of MGYP to assembly metadata. File split to reduce file size. Assembly metadata is encoded in the format "ERZ.MGYC:start-end:strand:full/partial". For strand, 1=fwd, -1=rev. - format = MGYP | assembly metadata (multiples separated by ";") mgy_contig_map_N.tsv.gz - mapping of MGYC to contig name in assembly. File split to reduce file size. - format = MGYC | contig name reassigned_mgyps.tsv.gz - Mapping of suppressed MGYPs to their corresponding reassigned MGYPs. - Format: Suppressed MGYP | Reassigned MGYP