Accelerated motif search algorithms on dna sequences by quim6 · Pull Request #5016 · biopython/biopython

quim6 · 2025-06-19T22:56:56Z

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

We have developed a set of functions for fast search of motif instances in DNA sequences. Specifically, we have implemented versions of the lookahead, permuted lookahead and super-alphabet algorithms, as described in:

Cinzia Pizzi, Esko Ukkonen - Fast profile matching algorithms — A survey, Theoretical Computer Science, Volume 395, Issues 2–3, 2008, 137-157. ISSN 0304-3975.

These methods have been implemented within the Motifs package, as C functions for fast search with a Python wrapper, following the template of the existing _pwm_calculate function. These functions implement threshold-based search (returning only hits above a pre-specified threshold, as the existing motifs.pssm.search method). In contrast to that method, however, they do not call the C calculate function. Rather, the speedup (around 2x to 6x) provided by these methods is based on the ability to not compute all the scores in the sequence, dropping any sequences that cannot reach the threshold.

The Python fast_search wrapper function has been designed so that new methods for fast search can be efficiently incorporated into the extension.

mdehoon · 2025-06-19T23:14:54Z

@quim6
Thank you. Can you remove the style changes in the unrelated files (e.g. in Bio/Align/bigbed.py) from this PR?

mdehoon · 2025-06-19T23:15:37Z

setup.py

+            "Bio/motifs/search_algorithms.c",
+            "Bio/motifs/darray.c",
+        ],
+        include_dirs=[numpy.get_include()],


Do you need to include the numpy headers?

It's currently used to create and return a numpy array from the C function. 'py_search'
Is including NumPy headers problematic? If it is I can move the array creation to pure Python.

Is including NumPy headers problematic?

Yes, as the user may have not installed them. It doesn't matter with a binary distribution, but if they are installing from source, it tends to cause confusion.

If it is I can move the array creation to pure Python.

Yes, this is the approach we take in other parts of Biopython. Figure out what the dimensions of the output array will be, create it in Python, and pass it to your C function. Then use the buffer interface to fill the array with the values calculated by the C function.

Another solution is to define a new class corresponding to one Hit. That may be better anyway, because then you can store the position as an integer instead of a floating point value.

Or you can return two arrays, one for the positions and one for the scores.

mdehoon · 2025-06-19T23:16:40Z

Bio/motifs/search_algorithms.c

+*           - If the partial score is less than min_scores[j], break early (prune)
+*       - If the full subsequence score ≥ threshold, store the match.
+*
+* Argsuments:


Argsuments -> arguments.
(I noticed this in a few more places)

Thanks for catching this! I'll fix the typo immediately and remove the unrelated style changes (caused by my Black formatting issues).
Sorry about the noise I will push a clean version shortly.

quim6 · 2025-06-19T23:56:30Z

I sincerely apologize my Black formatting attempt accidentally reformatted many unrelated files. I'm cleaning this up now to isolate only the intended changes.

quim6 requested review from JoaoRodrigues, mdehoon and peterjc as code owners June 19, 2025 22:56

mdehoon reviewed Jun 19, 2025

View reviewed changes

accelerated motif search (clean version)

d35fe3c

quim6 force-pushed the accelerated-pattern-search-on-DNA-sequences branch from 2d9a3a2 to d35fe3c Compare June 20, 2025 00:33

Argsument -> Argument typo fixed on search_algorithms.c

44dab50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerated motif search algorithms on dna sequences#5016

Accelerated motif search algorithms on dna sequences#5016
quim6 wants to merge 2 commits intobiopython:masterfrom
quim6:accelerated-pattern-search-on-DNA-sequences

quim6 commented Jun 19, 2025

Uh oh!

mdehoon commented Jun 19, 2025

Uh oh!

mdehoon Jun 19, 2025

Uh oh!

quim6 Jun 19, 2025

Uh oh!

mdehoon Jun 20, 2025

Uh oh!

mdehoon Jun 20, 2025

Uh oh!

mdehoon Jun 19, 2025

Uh oh!

quim6 Jun 19, 2025

Uh oh!

quim6 commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quim6 commented Jun 19, 2025

Uh oh!

mdehoon commented Jun 19, 2025

Uh oh!

mdehoon Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

quim6 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

mdehoon Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

mdehoon Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

mdehoon Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

quim6 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

quim6 commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants