Skip to content

Accelerated motif search algorithms on dna sequences#5016

Open
quim6 wants to merge 2 commits intobiopython:masterfrom
quim6:accelerated-pattern-search-on-DNA-sequences
Open

Accelerated motif search algorithms on dna sequences#5016
quim6 wants to merge 2 commits intobiopython:masterfrom
quim6:accelerated-pattern-search-on-DNA-sequences

Conversation

@quim6
Copy link

@quim6 quim6 commented Jun 19, 2025

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

We have developed a set of functions for fast search of motif instances in DNA sequences. Specifically, we have implemented versions of the lookahead, permuted lookahead and super-alphabet algorithms, as described in:

Cinzia Pizzi, Esko Ukkonen - Fast profile matching algorithms — A survey, Theoretical Computer Science, Volume 395, Issues 2–3, 2008, 137-157. ISSN 0304-3975.

These methods have been implemented within the Motifs package, as C functions for fast search with a Python wrapper, following the template of the existing _pwm_calculate function. These functions implement threshold-based search (returning only hits above a pre-specified threshold, as the existing motifs.pssm.search method). In contrast to that method, however, they do not call the C calculate function. Rather, the speedup (around 2x to 6x) provided by these methods is based on the ability to not compute all the scores in the sequence, dropping any sequences that cannot reach the threshold.

The Python fast_search wrapper function has been designed so that new methods for fast search can be efficiently incorporated into the extension.

@mdehoon
Copy link
Contributor

mdehoon commented Jun 19, 2025

@quim6
Thank you. Can you remove the style changes in the unrelated files (e.g. in Bio/Align/bigbed.py) from this PR?

"Bio/motifs/search_algorithms.c",
"Bio/motifs/darray.c",
],
include_dirs=[numpy.get_include()],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to include the numpy headers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently used to create and return a numpy array from the C function. 'py_search'
Is including NumPy headers problematic? If it is I can move the array creation to pure Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is including NumPy headers problematic?

Yes, as the user may have not installed them. It doesn't matter with a binary distribution, but if they are installing from source, it tends to cause confusion.

If it is I can move the array creation to pure Python.

Yes, this is the approach we take in other parts of Biopython. Figure out what the dimensions of the output array will be, create it in Python, and pass it to your C function. Then use the buffer interface to fill the array with the values calculated by the C function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution is to define a new class corresponding to one Hit. That may be better anyway, because then you can store the position as an integer instead of a floating point value.

Or you can return two arrays, one for the positions and one for the scores.

* - If the partial score is less than min_scores[j], break early (prune)
* - If the full subsequence score ≥ threshold, store the match.
*
* Argsuments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argsuments -> arguments.
(I noticed this in a few more places)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I'll fix the typo immediately and remove the unrelated style changes (caused by my Black formatting issues).
Sorry about the noise I will push a clean version shortly.

@quim6
Copy link
Author

quim6 commented Jun 19, 2025

I sincerely apologize my Black formatting attempt accidentally reformatted many unrelated files. I'm cleaning this up now to isolate only the intended changes.

@quim6 quim6 force-pushed the accelerated-pattern-search-on-DNA-sequences branch from 2d9a3a2 to d35fe3c Compare June 20, 2025 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants