K-mer counting programming challenge

Instructions

Create a private GitHub repository based on this template, either on github.com or using the GitHub CLI:

gh repo create software-assignment --clone --private --template Basecamp-Research/software-assignment

Write a solution to the challenge described below in a language of your choice, as long as it is reasonably common.
Once you are finished, invite Alexandros Papadopoulos (GitHub: @alexpBCR) as a collaborator to your private repository.

Alternatively, you may compress your solution and send it by email to alex.papadopoulos@basecamp-research.com.

Problem description

Biological background

DNA sequences are long strings built from just four letters: A, C, G, T.

A k-mer is a substring of length k taken from a DNA sequence.

For example, for k = 4 and the sequence ACGTAC:

ACGT, CGTA, GTAC

K-mers are fundamental in genomics and are widely used in tasks such as genome assembly, mutation detection, sequence comparison, and indexing very large datasets. For this exercise, you can treat them simply as sliding windows over a string.

Your task

You are to build a command-line tool that:

Reads DNA sequences from a FASTA file
Counts how often each k-mer occurs across all sequences
Writes the results to a CSV file

Input format

The input is a FASTA file, for example:

>sequence1
ACGTACGTACGT
>sequence2
TTTACGAC

You should expect the following characteristics:

Many sequences
Very long sequences (potentially millions of characters)
Large file sizes (up to ~1GB)
Sequence lines may be wrapped or unwrapped
You should not assume the entire file fits comfortably in memory

Output format (CSV)

Your program should output a CSV file with the following structure:

kmer,count
ACGTA,1023
CGTAC,998
...

Requirements:

Always include a header
Exactly two columns: kmer and count
Each distinct k-mer appears exactly once
Ordering of rows is up to you (document your choice)

Command-line interface

Your tool should be runnable as:

kmer-counter --input <path> --k <integer> --output <path>

Required flags:

--input — path to the FASTA file
--k — k-mer length
--output — path to the CSV output file

You may add optional flags if you believe they improve usability.

Functional requirements

Count all exact substring matches of length k
Combine counts across all sequences
Handle large input files efficiently
Support k in the range 1 to 20
Treat DNA as a simple string over A / C / G / T
Define and document your approach to:
- mixed case characters (A vs a)
- ambiguous characters (e.g. N)

Performance requirements

Your solution should be able to process inputs such as:

A ~1GB FASTA file
Hundreds of millions of k-mers

With:

Reasonable memory usage
Reasonable runtime on a typical laptop

You are not expected to provide exact benchmarks, but your design should clearly justify why it scales to inputs of this size.

Examples

Given the input file:

>sequence1
ACGTAC

And k = 4, the output should include:

kmer,count
ACGT,1
CGTA,1
GTAC,1

Additional example inputs and outputs can be found in the examples/ directory.

Deliverables

Please submit:

Source code with a clear, maintainable project structure
A README describing:
- How to build and run the program
- Any assumptions made
- Input and output details

Evaluation criteria

The solution will be evaluated on the following criteria:

Correctness: does the program produce correct k-mer counts
Clarity: how easy it is for others to understand and work with the code
Algorithmic complexity: how performance scales with increasing input size
Efficiency: ability to handle very large FASTA files without excessive memory usage
Engineering quality: structure, documentation, and robustness of the solution

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-mer counting programming challenge

Instructions

Problem description

Biological background

Your task

Input format

Output format (CSV)

Command-line interface

Functional requirements

Performance requirements

Examples

Deliverables

Evaluation criteria

About

Uh oh!

Releases

Packages

Basecamp-Research/software-assignment

Folders and files

Latest commit

History

Repository files navigation

K-mer counting programming challenge

Instructions

Problem description

Biological background

Your task

Input format

Output format (CSV)

Command-line interface

Functional requirements

Performance requirements

Examples

Deliverables

Evaluation criteria

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages