Skip to content

Basecamp-Research/software-assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

K-mer counting programming challenge

Instructions

  1. Create a private GitHub repository based on this template, either on github.com or using the GitHub CLI:

    gh repo create software-assignment --clone --private --template Basecamp-Research/software-assignment
  2. Write a solution to the challenge described below in a language of your choice, as long as it is reasonably common.

  3. Once you are finished, invite Alexandros Papadopoulos (GitHub: @alexpBCR) as a collaborator to your private repository.

    Alternatively, you may compress your solution and send it by email to alex.papadopoulos@basecamp-research.com.


Problem description

Biological background

DNA sequences are long strings built from just four letters: A, C, G, T.

A k-mer is a substring of length k taken from a DNA sequence.

For example, for k = 4 and the sequence ACGTAC:

ACGT, CGTA, GTAC

K-mers are fundamental in genomics and are widely used in tasks such as genome assembly, mutation detection, sequence comparison, and indexing very large datasets. For this exercise, you can treat them simply as sliding windows over a string.


Your task

You are to build a command-line tool that:

  • Reads DNA sequences from a FASTA file
  • Counts how often each k-mer occurs across all sequences
  • Writes the results to a CSV file

Input format

The input is a FASTA file, for example:

>sequence1
ACGTACGTACGT
>sequence2
TTTACGAC

You should expect the following characteristics:

  • Many sequences
  • Very long sequences (potentially millions of characters)
  • Large file sizes (up to ~1GB)
  • Sequence lines may be wrapped or unwrapped
  • You should not assume the entire file fits comfortably in memory

Output format (CSV)

Your program should output a CSV file with the following structure:

kmer,count
ACGTA,1023
CGTAC,998
...

Requirements:

  • Always include a header
  • Exactly two columns: kmer and count
  • Each distinct k-mer appears exactly once
  • Ordering of rows is up to you (document your choice)

Command-line interface

Your tool should be runnable as:

kmer-counter --input <path> --k <integer> --output <path>

Required flags:

  • --input — path to the FASTA file
  • --k — k-mer length
  • --output — path to the CSV output file

You may add optional flags if you believe they improve usability.


Functional requirements

  • Count all exact substring matches of length k

  • Combine counts across all sequences

  • Handle large input files efficiently

  • Support k in the range 1 to 20

  • Treat DNA as a simple string over A / C / G / T

  • Define and document your approach to:

    • mixed case characters (A vs a)
    • ambiguous characters (e.g. N)

Performance requirements

Your solution should be able to process inputs such as:

  • A ~1GB FASTA file
  • Hundreds of millions of k-mers

With:

  • Reasonable memory usage
  • Reasonable runtime on a typical laptop

You are not expected to provide exact benchmarks, but your design should clearly justify why it scales to inputs of this size.


Examples

Given the input file:

>sequence1
ACGTAC

And k = 4, the output should include:

kmer,count
ACGT,1
CGTA,1
GTAC,1

Additional example inputs and outputs can be found in the examples/ directory.


Deliverables

Please submit:

  • Source code with a clear, maintainable project structure

  • A README describing:

    • How to build and run the program
    • Any assumptions made
    • Input and output details

Evaluation criteria

The solution will be evaluated on the following criteria:

  • Correctness: does the program produce correct k-mer counts
  • Clarity: how easy it is for others to understand and work with the code
  • Algorithmic complexity: how performance scales with increasing input size
  • Efficiency: ability to handle very large FASTA files without excessive memory usage
  • Engineering quality: structure, documentation, and robustness of the solution

About

K-mer Counting Tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published