-
Create a private GitHub repository based on this template, either on github.com or using the GitHub CLI:
gh repo create software-assignment --clone --private --template Basecamp-Research/software-assignment
-
Write a solution to the challenge described below in a language of your choice, as long as it is reasonably common.
-
Once you are finished, invite Alexandros Papadopoulos (GitHub:
@alexpBCR) as a collaborator to your private repository.Alternatively, you may compress your solution and send it by email to alex.papadopoulos@basecamp-research.com.
DNA sequences are long strings built from just four letters: A, C, G, T.
A k-mer is a substring of length k taken from a DNA sequence.
For example, for k = 4 and the sequence ACGTAC:
ACGT, CGTA, GTAC
K-mers are fundamental in genomics and are widely used in tasks such as genome assembly, mutation detection, sequence comparison, and indexing very large datasets. For this exercise, you can treat them simply as sliding windows over a string.
You are to build a command-line tool that:
- Reads DNA sequences from a FASTA file
- Counts how often each k-mer occurs across all sequences
- Writes the results to a CSV file
The input is a FASTA file, for example:
>sequence1
ACGTACGTACGT
>sequence2
TTTACGAC
You should expect the following characteristics:
- Many sequences
- Very long sequences (potentially millions of characters)
- Large file sizes (up to ~1GB)
- Sequence lines may be wrapped or unwrapped
- You should not assume the entire file fits comfortably in memory
Your program should output a CSV file with the following structure:
kmer,count
ACGTA,1023
CGTAC,998
...
Requirements:
- Always include a header
- Exactly two columns:
kmerandcount - Each distinct k-mer appears exactly once
- Ordering of rows is up to you (document your choice)
Your tool should be runnable as:
kmer-counter --input <path> --k <integer> --output <path>Required flags:
--input— path to the FASTA file--k— k-mer length--output— path to the CSV output file
You may add optional flags if you believe they improve usability.
-
Count all exact substring matches of length k
-
Combine counts across all sequences
-
Handle large input files efficiently
-
Support k in the range 1 to 20
-
Treat DNA as a simple string over A / C / G / T
-
Define and document your approach to:
- mixed case characters (
Avsa) - ambiguous characters (e.g.
N)
- mixed case characters (
Your solution should be able to process inputs such as:
- A ~1GB FASTA file
- Hundreds of millions of k-mers
With:
- Reasonable memory usage
- Reasonable runtime on a typical laptop
You are not expected to provide exact benchmarks, but your design should clearly justify why it scales to inputs of this size.
Given the input file:
>sequence1
ACGTAC
And k = 4, the output should include:
kmer,count
ACGT,1
CGTA,1
GTAC,1
Additional example inputs and outputs can be found in the examples/ directory.
Please submit:
-
Source code with a clear, maintainable project structure
-
A README describing:
- How to build and run the program
- Any assumptions made
- Input and output details
The solution will be evaluated on the following criteria:
- Correctness: does the program produce correct k-mer counts
- Clarity: how easy it is for others to understand and work with the code
- Algorithmic complexity: how performance scales with increasing input size
- Efficiency: ability to handle very large FASTA files without excessive memory usage
- Engineering quality: structure, documentation, and robustness of the solution