Accelerated motif search algorithms on dna sequences#5016
Accelerated motif search algorithms on dna sequences#5016quim6 wants to merge 2 commits intobiopython:masterfrom
Conversation
|
@quim6 |
| "Bio/motifs/search_algorithms.c", | ||
| "Bio/motifs/darray.c", | ||
| ], | ||
| include_dirs=[numpy.get_include()], |
There was a problem hiding this comment.
Do you need to include the numpy headers?
There was a problem hiding this comment.
It's currently used to create and return a numpy array from the C function. 'py_search'
Is including NumPy headers problematic? If it is I can move the array creation to pure Python.
There was a problem hiding this comment.
Is including NumPy headers problematic?
Yes, as the user may have not installed them. It doesn't matter with a binary distribution, but if they are installing from source, it tends to cause confusion.
If it is I can move the array creation to pure Python.
Yes, this is the approach we take in other parts of Biopython. Figure out what the dimensions of the output array will be, create it in Python, and pass it to your C function. Then use the buffer interface to fill the array with the values calculated by the C function.
There was a problem hiding this comment.
Another solution is to define a new class corresponding to one Hit. That may be better anyway, because then you can store the position as an integer instead of a floating point value.
Or you can return two arrays, one for the positions and one for the scores.
Bio/motifs/search_algorithms.c
Outdated
| * - If the partial score is less than min_scores[j], break early (prune) | ||
| * - If the full subsequence score ≥ threshold, store the match. | ||
| * | ||
| * Argsuments: |
There was a problem hiding this comment.
Argsuments -> arguments.
(I noticed this in a few more places)
There was a problem hiding this comment.
Thanks for catching this! I'll fix the typo immediately and remove the unrelated style changes (caused by my Black formatting issues).
Sorry about the noise I will push a clean version shortly.
|
I sincerely apologize my Black formatting attempt accidentally reformatted many unrelated files. I'm cleaning this up now to isolate only the intended changes. |
2d9a3a2 to
d35fe3c
Compare
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rstfile, have runpre-commitlocally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rstandCONTRIB.rstas part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
We have developed a set of functions for fast search of motif instances in DNA sequences. Specifically, we have implemented versions of the lookahead, permuted lookahead and super-alphabet algorithms, as described in:
Cinzia Pizzi, Esko Ukkonen - Fast profile matching algorithms — A survey, Theoretical Computer Science, Volume 395, Issues 2–3, 2008, 137-157. ISSN 0304-3975.
These methods have been implemented within the Motifs package, as C functions for fast search with a Python wrapper, following the template of the existing
_pwm_calculatefunction. These functions implement threshold-based search (returning only hits above a pre-specified threshold, as the existingmotifs.pssm.searchmethod). In contrast to that method, however, they do not call the Ccalculatefunction. Rather, the speedup (around 2x to 6x) provided by these methods is based on the ability to not compute all the scores in the sequence, dropping any sequences that cannot reach the threshold.The Python
fast_searchwrapper function has been designed so that new methods for fast search can be efficiently incorporated into the extension.