How We Built the First AI-Generated Genomes

Going from designing individual genes to complete genomes is an incredibly challenging problem. We have previously shown that the genomic foundation models like the Evo series can generate single proteins and multi-component systems like CRISPR-Cas complexes or protein-protein interactions. However, even before we developed Evo, one of our longstanding goals has been to design a complete, functional genome with a biological language model.

Genome design requires orchestrating multiple interacting genes and regulatory elements while maintaining a balance that enables replication, host specificity, and evolutionary fitness. This increase in complexity introduces new constraints and failure modes that do not arise when only designing a single protein or a two-component system.

Here we detail some of the technical innovations that enabled us to generate viable bacteriophage genomes with substantial evolutionary novelty. Our approach required developing a comprehensive computational and experimental design framework, including a custom gene annotation pipeline for overlapping reading frames, systematic fine-tuning and prompt engineering strategies for sampling from genome language models, and new screening protocols for synthetic phage genomes.

For further reading, the manuscript is posted on BioRxiv.

Reading, Writing, and Designing ΦX174

Because generating synthetic genomes requires clear design criteria, we chose bacteriophage ΦX174 as our design template for practical and historical reasons. At 5,386 nucleotides encoding 11 genes, ΦX174 sits at the upper limit of what current DNA synthesis costs make practical while remaining complex enough to demonstrate genome-scale design capabilities. The overlapping gene architecture creates a stringent test case where mutations in overlapping regions must satisfy the constraints of multiple proteins simultaneously. Additionally, ΦX174 encodes various regulatory elements and recognition sequences that carefully work together to ensure its correct packaging and replication in host cells.

The genome of ΦX174 is also historically significant. In 1977, it was the first complete genome ever sequenced by Fred Sanger and colleagues. In 2003, it was the first whole genome ever chemically synthesized, proving that genomes could be assembled from scratch, by Craig Venter and colleagues. Now, in 2025, we have used ΦX174 as a template to produce the first AI-generated genomes. This progression represents the fundamental capabilities that define modern genomics: we learned to read DNA, then to write it, and now to design it.

Building Custom Gene Annotation

ΦX174's overlapping genes presented our first major challenge: standard gene prediction methods could identify at most 7 of ΦX174's 11 genes due to overlapping reading frames that confound tools designed for non-overlapping genes.

We developed our own annotation pipeline combining ORF-finding strategies with homology searches against a phage protein database. The method identified all 11 genes in ΦX174, with partial prediction of gene A*.

The custom annotation tool became essential for evaluating thousands of generated sequences. We required at least 7 predicted protein hits to natural ΦX174 proteins as a basic quality filter, ensuring that generated genomes retained the genetic toolkit for phage function.

Fine-Tuning Evo for Phage Genome Generation

While the base Evo models, which were trained on over two million phage genomes, already have some capability of generating phage genomic sequences, they lacked the controllability needed for ΦX174-like genome generation. We addressed this through a technique called supervised fine-tuning in which we continued the Evo models’ training on a curated dataset of 14,466 Microviridae sequences, clustered at 99% identity to reduce redundancy. With fine-tuning, we were able to specialize the Evo models on sequence variation closely related to ΦX174. Fine-tuning also enabled us to directly instruct Evo to generate ΦX174-like sequences when given the right sequence prompt. We found that careful tuning of this prompt, as well as other sampling parameters, was required to balance between generating sequences that phylogenetically resembled ΦX174 without being too close to the wild-type sequence.

Assessing Quality, Host-Specificity, and Novelty

Evaluating thousands of generated sequences also required developing a way to filter the genomes based on sequence quality, host specificity, and evolutionary diversity; we needed to ensure that the generated genomes preserved a reasonable gene arrangement while allowing evolutionary novelty. We also needed to ensure that the AI-designed phages would infect E. coli C, the non-pathogenic bacterial strain used in our experiments; thus, we required that sequences contain a similar spike protein to that of ΦX174, since the spike protein determines ΦX174’s host range. In our experiments, all 16 functional phages showed restricted tropism to E. coli C and the related strain E. coli W, with no growth on six other tested strains, demonstrating that host specificity could be maintained while allowing substantial evolutionary divergence elsewhere.

Experimental Validation

Testing hundreds of synthetic genomes required rethinking traditional phage workflows. We developed a growth inhibition assay based on ΦX174’s lytic lifecycle: synthetic genomes are assembled via Gibson assembly, transformed into competent E. coli C, and monitored for growth inhibition in 96-well format. Infections show rapid OD₆₀₀ decline within 2-3 hours. This protocol enabled rapid testing of 285 designs. 16 phage candidates that led to growth inhibition were sequence-verified, propagated to working stocks, and characterized for fitness and host range.

Of the functional genomes, each harbored 67–392 novel mutations compared to their nearest natural genome. Evo-Φ2147, with 392 mutations and 93.0% average nucleotide identity to phage NC51, would qualify as a new species under some taxonomic thresholds. Thirteen genomes contained mutations that could not be found in any known natural sequences, demonstrating that Evo could use sequences beyond those sampled by natural evolution.

Very interestingly, one of the synthetic phages, Evo-Φ36, incorporated the DNA packaging J protein from distantly related phage G4, something that had defeated previous rational engineering attempts. We performed cryo-EM to reveal that the shorter G4 J protein (25 vs 38 amino acids) adopts a distinct orientation within the capsid structure. This shows the AI's ability to coordinate complex compensatory mutations that enable novel protein combinations to function.

Overcoming Bacterial Resistance

Bacterial resistance to antibiotics represents one of the most pressing challenges in modern medicine, with resistant infections killing hundreds of thousands or more annually. Bacteria can quickly evolve resistance to traditional antibiotics, limiting therapeutic effectiveness.

We wanted to see if we could one day design phage therapies that could be resilient against bacterial evolution. In our study, we evolved three ΦX174-resistant E. coli strains harboring mutations in the waa operon, which modifies bacterial surface receptors. AI-generated phage cocktails overcame resistance in all three strains within 1-5 passages, while ΦX174 alone failed completely.

The breakthrough phages were mosaic genomes derived from multiple AI designs through recombination, with mutations concentrated in surface-exposed regions that interact with bacterial receptors. Sequence analysis revealed that the successful phages combined genetic elements from 2-3 different AI-generated designs, suggesting that the diversity created by our approach provided multiple evolutionary pathways for overcoming resistance.

This demonstrates a key advantage of AI-generated phage diversity. Rather than hoping nature has evolved the right phage for a specific resistance mechanism, we can generate diverse populations that collectively present multiple targets, making it much harder for bacteria to develop comprehensive resistance. The AI’s exploration of sequence space provides raw material for rapid adaptation to resistance mechanisms, potentially transforming phage therapy from a trial-and-error process into a systematic approach for staying ahead of bacterial evolution.

Biosafety and Biocontainment

As this technology advances, we remain committed to maintaining the highest standards of biosafety and biocontainment. This work was conducted with comprehensive safety protocols that exceeded standard requirements. All experiments took place in dedicated biosafety cabinets with specialized disposal procedures, and equipment never left the containment area. We used only non-pathogenic bacterial hosts (E. coli C and other common laboratory strains of E. coli), ensuring that both the original and AI-designed bacteriophages posed no risk to human health.

Our computational approach incorporates built-in safety constraints. Evo cannot generate human viral sequences due to deliberate training data exclusions, preventing both accidental and intentional misuse for pathogen design. The template-based method provides additional safeguards—by starting with well-characterized, non-pathogenic systems and maintaining host specificity through spike protein conservation, we could explore new genomic possibilities while remaining within established risk boundaries.

Future Directions

Phage therapy is becoming an increasingly effective way of fighting multi-drug resistant bacteria. Near-term therapeutic targets include plant pathogen-specific phages and larger DNA phages with gene architectures that have similar or easier design difficulty compared to ΦX174. Clinically relevant bacterial pathogens in these settings are Pseudomonas aeruginosa, which causes respiratory infections, or Xanthomonas capestris, which causes black rot disease in crops.

This work shows that current genome language models, when properly trained and guided, can capture evolutionary constraints well enough to enable functional genome design. Specialized training, systematic quality control, and high-throughput experimental validation can bridge the gap between AI-generated sequences and biological reality.

As genome language models improve and synthesis costs decrease, whole-genome design could explore evolutionary possibilities that natural selection has never sampled, opening new avenues for biotechnology and basic research. The transition from reading and writing genomes to designing them represents a new chapter in our ability to engineer biology at its foundational level.