Skip to content

API and usage of main functions in Python programs

Rauf Salamzade edited this page Nov 11, 2025 · 4 revisions

The main functions in codoff are codoff_main_coords and codoff_main_gbk. These can be called upon as functions in Python.

Note: As of v1.2.3, codoff uses sequential contiguous-window sampling, which compares the focal region against randomly positioned genomic windows of the same size. This provides a more biologically realistic null distribution than random gene sampling.

API

codoff_main_coords(full_genome_file, focal_scaffold, focal_start_coord, focal_end_coord, outfile=None, plot_outfile=None, verbose=True, num_sims=10000, seed=None)

A full genome file can be provided in either GenBank or FASTA format. If the latter, pyrodigal is used for gene calling, so it only works for bacteria. Afterwards, coordinates provided by users for the focal region of interest are used to partition which locus tags for CDS features belong to the focal region and which belong to the background genome. As of v1.2.3, it uses sequential contiguous-window sampling, which randomly selects genomic windows of the same size as the focal region for more biologically realistic null distributions. It calls the private function _stat_calc_and_simulation() to perform the main statistical calculations and simulations to compute a discordance percentile.

Argument Type Description
full_genome_file str The path to the full genome file in GenBank or FASTA format. [Required].
focal_scaffold str The scaffold identifier for the focal region. [Required].
focal_start_coord int The start coordinate for the focal region (1-based). [Required].
focal_end_coord int The end coordinate for the focal region (inclusive). [Required].
outfile str The path to the output file [Default is None].
plot_outfile str The path to the plot output file in SVG format. If not provided, no plot will be made [Default is None].
verbose bool Whether to print progress messages to stderr [Default is True].
num_sims int Number of simulations to run for the null distribution [Default is 10000].
seed int Random seed for reproducible results [Default is None].

codoff_main_gbk(full_genome_file, focal_genbank_files, outfile=None, plot_outfile=None, verbose=True, genome_data=None, num_sims=10000, seed=None)

A full genome and a specific region must each be provided in GenBank format, with locus_tags overlapping. locus_tags in the focal region GenBank that are not in the full genome GenBank will be ignored. As of v1.2.3, it uses sequential contiguous-window sampling, which randomly selects genomic windows of the same size as the focal region for more biologically realistic null distributions. Genomic coordinates are automatically extracted for efficient sampling. It calls the private function _stat_calc_and_simulation() to perform the main statistical calculations and simulations to compute a discordance percentile.

Argument Type Description
full_genome_file str The path to the full genome file in GenBank format. [Required unless genome_data is provided].
focal_genbank_files list A list of paths to GenBank files corresponding to the focal region. Note, these should not be multiple independent BGCs, rather, the ability to take multiple focal region GenBanks is to allow for fragmented pieces of the same BGC due to assembly incompleteness. [Required].
outfile str The path to the output file [Default is None].
plot_outfile str The path to the plot output file in SVG format. If not provided, no plot will be made [Default is None].
verbose bool Whether to print progress messages to stderr [Default is True].
genome_data dict Pre-computed genome data from extract_genome_codon_data() to avoid redundant computation [Default is None].
num_sims int Number of simulations to run for the null distribution [Default is 10000].
seed int Random seed for reproducible results [Default is None].

Reproducibility

Results can be made reproducible by setting the seed parameter to any integer value. When using the same seed, you will get identical results across multiple runs. In the Python API, seed defaults to None (non-reproducible). In the CLI, seed defaults to 42 for convenience.

Results

Both functions will return a dictionary with the following attributes:

Key Type Value
empirical_freq float The discordance frequency (proportion from 0-1) indicating how often simulated regions show codon usage as or more discordant than the observed focal region. Multiply by 100 for discordance percentile.
cosine_distance float The cosine distance between the focal region and genome-wide codon usage profiles.
rho float Spearman's rho between the focal region and genome-wide codon usage profiles.
codon_order list of strs A listing of codons which is in the same order as the following two lists.
focal_region_codons list of ints A list of codon counts for focal region.
background_genome_codons list of ints A list of codon counts for background genome.

Usage examples

codoff_main_coords():

import os
import sys
from codoff import codoff

genome_fna = 'Some_Genome.fna' # nucleotide FASTA file (can be multi-FASTA) 

# provide coordinate information for region of interest (e.g. BGC, etc.) 
scaffold = 'ABC0001.1' # should match the header of some sequence in the FASTA until the first space.
start_coord = 10051
end_coord = 95060

# Run with default parameters (v1.2.3: uses sequential contiguous-window sampling)
result = codoff.codoff_main_coords(genome_fna, scaffold, start_coord, end_coord)

# Access discordance percentile
discordance_percentile = result['empirical_freq'] * 100.0
print(f"Discordance Percentile: {discordance_percentile:.2f}%")

# Run with custom parameters
result = codoff.codoff_main_coords(genome_fna, scaffold, start_coord, end_coord,
                                   num_sims=20000,
                                   seed=42)

codoff_main_gbk():

import os
import sys
from codoff import codoff

antismash_bgc_gbk = 'Some_BGC_of_Interest.gbk' # annotated BGC GenBank - e.g. one produced by antiSMASH
full_genome_gbk = 'Matching_Full_Genome.gbk' # annotated full-genome GenBank file - also produced by antiSMASH

output_file = 'codoff_results.tsv' # optional
output_plot = 'codoff_simulation_histogram.svg' # optional

# Run with default parameters
result = codoff.codoff_main_gbk(full_genome_gbk, 
                                [antismash_bgc_gbk], 
                                outfile=output_file, 
                                plot_outfile=output_plot, 
                                verbose=True)

# Access discordance percentile
discordance_percentile = result['empirical_freq'] * 100.0
print(f"Discordance Percentile: {discordance_percentile:.2f}%")

# Run with custom parameters and reproducible seed
result = codoff.codoff_main_gbk(full_genome_gbk, 
                                [antismash_bgc_gbk], 
                                num_sims=15000,
                                seed=42,
                                verbose=True)

# Example with cached genome data for multiple BGCs (efficient)
genome_data = codoff.extract_genome_codon_data(full_genome_gbk)
for bgc_gbk in ['BGC1.gbk', 'BGC2.gbk', 'BGC3.gbk']:
    result = codoff.codoff_main_gbk(full_genome_gbk, 
                                    [bgc_gbk],
                                    genome_data=genome_data,  # reuse cached data
                                    num_sims=10000,
                                    seed=42)
    print(f"{bgc_gbk}: {result['empirical_freq'] * 100:.2f}%")