Tag: Reference genome

An Updated Bacterial and Archaeal Reference Genome Collection is Available!

An Updated Bacterial and Archaeal Reference Genome Collection is Available!

Download the updated bacterial and archaeal reference genome collection! We built this collection of 22,420 genomes by selecting the “best” genome assembly for each species among the 450,000+ prokaryotic genomes in RefSeq. 

What’s new? 
  • One species is represented in this collection for the first time 
  • 323 species are represented by a better assembly
  • Six species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment 

Continue reading “An Updated Bacterial and Archaeal Reference Genome Collection is Available!”

An Updated Bacterial and Archaeal Reference Genome Collection is Available!

An Updated Bacterial and Archaeal Reference Genome Collection is Available!

Download the updated bacterial and archaeal reference genome collection! We built this collection of 22,082 genomes by selecting the “best” genome assembly for each species among the 440,000+ prokaryotic genomes in RefSeq. 

What’s new? 
  • 28 species are represented in this collection for the first time 
  • 228 species are represented by a better assembly 
  • Six species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment 

Continue reading “An Updated Bacterial and Archaeal Reference Genome Collection is Available!”

Now Available: Updated Bacterial and Archaeal Reference Genome Collection

Now Available: Updated Bacterial and Archaeal Reference Genome Collection

Download the updated bacterial and archaeal reference Genome collection! We built this collection of 21,794 genomes by selecting the “best” genome assembly for each species among the 400,000+ prokaryotic genomes in RefSeq, which is 536 more than was included in the January release. Continue reading “Now Available: Updated Bacterial and Archaeal Reference Genome Collection”

Updated Genomes Terminology! “Representative Genome” is Replaced with “Reference Genome”

Updated Genomes Terminology! “Representative Genome” is Replaced with “Reference Genome”

NCBI is streamlining the terminology around our reference genomes. We currently have a small set of genomes collectively called representatives and an even smaller set called references. We have slowly converged on the term reference to refer to both sets.  

A genome is labeled reference if it is deemed to be the best available genome for the species based on assembly, annotation metrics (when available), and, in a small number of cases, curatorial review. The set of eukaryotic reference assemblies is updated continuously as new assemblies are submitted to GenBank. The set of prokaryotic references are recalculated three times a year.  

Important Note: Classification of “reference genome” is separate from inclusion in RefSeq – while genomes in RefSeq are preferentially used to pick the reference genome, a reference genome can also be chosen for species not included in RefSeq.   Continue reading “Updated Genomes Terminology! “Representative Genome” is Replaced with “Reference Genome””

Announcing Human Annotation Release 110

Announcing Human Annotation Release 110

The annotation of human assemblies GRCh38.p14 and T2T-CHM13v2.0

We are happy to announce the first de novo annotation of human T2T-CHM13v2.0, the gap-less assembly generated by the T2T Consortium, and the full re-annotation of the human reference assembly, GRCh38.p14. We hope the results will serve both the needs of those eager to explore newly sequenced regions of the genome, including telomeres and centromeres, and those interested in refreshing their interpretation of the human reference, in light of recently curated transcripts and new transcriptomic and other data incorporated in the annotation. Continue reading “Announcing Human Annotation Release 110”

New RefSeq annotations!

New RefSeq annotations!

In December and January, the NCBI Eukaryotic Genome Annotation Pipeline released twenty-four new annotations in RefSeq for the following organisms:

    • Aegilops tauschii (monocot)
    • Camelus bactrianus (Bactrian camel)
    • Colias croceus (clouded yellow)
    • Echinops telfairi (small Madagascar hedgehog)
    • Harmonia axyridis (beetle)
    • Lemur catta (Ring-tailed lemur)
    • Leopardus geoffroyi (Geoffroy’s cat)
    • Macaca fascicularis (crab-eating macaque)
    • Maniola jurtina (meadow brown)
    • Meles meles (Eurasian badger)
    • Melitaea cinxia (Glanville fritillary) (pictured) 

Continue reading “New RefSeq annotations!”

Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!

We have re-annotated all RefSeq genomes for Escherichia coliMycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.

The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.

Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange). 

Continue reading “Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!”

Announcing the RefSeq annotation of rat mRatBN7.2!

Announcing the RefSeq annotation of rat mRatBN7.2!

NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!

Continue reading “Announcing the RefSeq annotation of rat mRatBN7.2!”

Updated and improved collection of RefSeq representative genome assemblies now available

We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.

We have updated the database on the Microbial Nucleotide BLAST page as well as the basic nucleotide BLAST RefSeq Representative Genome Database, to reflect these changes.

You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“.  A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!

Major update for the NCBI RefSeq mouse GRCm38.p6 annotation

We have updated our annotation for the mouse reference genome, GRCm38.p6. It includes:

  • Markup for RefSeq Select, which identifies one representative transcript and protein for every protein-coding gene. Find features with the ‘tag=RefSeq Select’ attribute in GFF3 for those analyses where you need just a single transcript or protein for each coding gene. You can also find these RefSeqs in Entrez using the query ‘refseq_select[filter].’
  • Annotation updates made in the last year for over 2000 genes, including over 4000 new or revised curated transcripts. This includes targeted curation to ensure we are representing well-expressed and conserved transcripts for inclusion in RefSeq Select.
  • Annotation of over 2300 regulatory and other functional element features from over 900 biological regions. These are now identified with the source “RefSeqFE” in GFF3 column 2 for easy parsing.

When citing, please refer to this annotation as NCBI Mus musculus Annotation Release 108.20200622. You can find the data in:

This is our last update before upgrading to the new major assembly version just released by the Genome Reference Consortium, GRCm39. We expect to be cranking up our compute farm in the next few weeks to produce a full annotation based on our latest curation and extensive short (Illumina) and long (PacBio IsoSeq and nanopore) RNA-seq data, which should be released later this summer. Stay tuned!