Notes
-----

Sequences and annotations from the 2024_04 release of the MGnify protein database. 

Protein sequences are derived from the analysis of publicly available metagenomics assemblies within MGnify using our combined gene caller (which uses both Prodigal and  FragGeneScan). Sequences are assigned an MGYP accession. MGYPs are non-redundant, so proteins with the same sequence are assigned the same MGYP identifier. Sequences are mapped to the assembly (ERZ) and contig (MGYC) location where they are found. The biome assignments come from the biome assigned to the assembly in MGnify. The sequences are then clustered using MMseqs/linclust, with coverage and sequence identity thresholds set at 90%. 

Pfam annotations are provided for the proteins by running HMMER using the Pfam significance thresholds (i.e. using --cut-ga parameter in HMMER).

In the fasta files the header line includes fields FL (1=full-length sequence, 0=partial sequence) and CR (1=cluster representative).


Statistics
----------

                Sequence         Cluster
Sequences:
Total         2455939992       717738164
Full           474611761       165275202
Partial       1981328231       552462962

Residues    465976885243    121126699551

Cite us
-------

To cite MGnify, please refer to the following publication:

Richardson LJ, Allen B, Baldi G, Beracochea M, Bileschi M, Burdett T, Burgin J, Caballero-Pérez J, Cochrane G, Colwell L, Curtis T, Escobar-Zepeda A, Gurbich T, Kale V, Korobeynikov A, Raj S, Rogers AB, Sakharova E, Sanchez S, Wilkinson D and Finn RD (2023). MGnify: the microbiome sequence data analysis resource in 2023 Nucleic Acids Research (2023) doi:10.1093/nar/gkac1080

Copyright notice
----------------

The MGnify protein database is provided under the terms of the CC0 1.0 Universal (CC0 1.0) licence (https://creativecommons.org/publicdomain/zero/1.0/).


Files
-----

LICENCE.txt
-  licence under which protein database is made available

md5sum.txt
- md5 checksum for all files included in release
- format = checksum | filename

mgy_assemblies.tsv.gz
- accessions of assemblies (ERZ) in which each MGYP is found
- format = MGYP | ERZ (multiples separated by ";")

mgy_biome_counts.tsv.gz
- observation counts of each biome
- format = count | biome path

mgy_biomes.tsv.gz
- biomes associated with each sequence. Sequences identified in multiple biomes will have multiple rows (one per-biome) in the table
- format = MGYP | count | biome

mgy_proteins_pfam.tsv.gz
- Pfam annotations generated by HMMER for all MGPs that result in an annotation
- format = MGYP | Pfam accession | i-Evalue | bit score for the domain | from (hmm coord)  | to (hmm coord) | from (env coord) | to (env coord)

mgy_cluster_seqs.tsv.gz
- list of cluster representatives (column 1) with cluster member sequences (column 2)
- format = MGYP | MGYP (multiples separated by ";")

mgy_clusters.fa.gz
- fasta file of cluster representative sequences. Header line includes MGYP and FL

mgy_clusters.tsv.gz
- statistics and biome information for clusters
- format = MGYP of CR | # sequences (cluster size) | # identical sequences | # assemblies contributing to cluster | redundant list of biomes for CR | non-redundant list of biomes in whole cluster

mgy_counts.tsv.gz
- observation counts of each MGYP
- format = MGYP | count

mgy_proteins_N.fa.gz
- fasta sequences for all MGYPs. File split to reduce file size. Header information includes MGYP, FL and CR

mgy_seq_metadata_N.tsv.gz
- mapping of MGYP to assembly metadata. File split to reduce file size. Assembly metadata is encoded in the format "ERZ.MGYC:start-end:strand:full/partial". For strand, 1=fwd, -1=rev.
- format = MGYP | assembly metadata (multiples separated by ";")

mgy_contig_map_N.tsv.gz
- mapping of MGYC to contig name in assembly. File split to reduce file size.
- format = MGYC | contig name

reassigned_mgyps.tsv.gz
- Mapping of suppressed MGYPs to their corresponding reassigned MGYPs.
- Format: Suppressed MGYP | Reassigned MGYP