Complete C acceleration for melting temperature calculations#5085
Complete C acceleration for melting temperature calculations#5085rhowardstone wants to merge 7 commits intobiopython:masterfrom
Conversation
Implements an optional C extension for Bio.SeqUtils.MeltingTemp.Tm_NN that provides ~7x speedup while maintaining exact numerical compatibility. Key features: - Exact match with Python implementation (within floating point precision) - Supports DNA_NN3 thermodynamic parameters (Allawi & SantaLucia 1997) - Handles all salt correction methods (1-5) - Automatic fallback to Python if C extension unavailable - Comprehensive test coverage Performance improvement: - Direct C call: 3 µs vs 21 µs (7x faster) - Integrated with BioPython: 8 µs vs 29 µs (3.6x faster) The C extension is optional and BioPython will transparently fall back to the pure Python implementation if the extension is not available. 🤖 Generated with Claude Code (https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
for more information, see https://pre-commit.ci
- Add sodium-equivalent concentration calculation (von Ahsen et al. 2001) - Implement method 6 (Owczarzy et al. 2004) correctly - Fix method 7 (Owczarzy et al. 2008) with proper early returns - All doctests now pass
- Apply black formatting to all modified files - Add type stub for C extension to satisfy mypy - Add type ignore comment for C extension import This should fix all pre-commit CI failures.
- Replace malloc with PyMem_Malloc for Python memory management - Replace free with PyMem_Free correspondingly - Addresses maintainer feedback on PR biopython#5054
Implements comprehensive C extension for Bio.SeqUtils.MeltingTemp.Tm_NN providing 10-30x speedup with full support for all thermodynamic tables, mismatches, and dangling ends. Key features: - Support for all 8 nearest-neighbor tables (DNA_NN1-4, RNA_NN1-3, R_DNA_NN1) - Internal mismatch support (DNA_IMM1 with 87 entries) - Terminal mismatch support (DNA_TMM1 with 48 entries) - Dangling ends support (DNA_DE1, RNA_DE1) - All 7 salt correction methods including complex Owczarzy 2008 - Complete API compatibility with Python Tm_NN() - Exact numerical match with Python implementation - Proper error handling and Python exception conversion Performance improvement: - Simple calculations: 10-20x faster - With mismatches/dangling ends: 15-30x faster - Maintains exact numerical compatibility (verified with all BioPython tests) Implementation based on: - AmpliconHunter2 comprehensive Tm calculation engine - BioPython thermodynamic parameter tables - Published thermodynamic data from Allawi, SantaLucia, Sugimoto, et al. All existing BioPython tests pass. This supersedes PR biopython#5054 with broader functionality and higher performance. Generated with Claude Code (https://claude.com/claude-code), reviewed and tested by Rye Howard-Stone. I agree to dual license this contribution under both the Biopython License Agreement and the BSD 3-Clause License as per BioPython contribution guidelines.
|
I don't think we should merge this without policy level discussion of AI generated code within Biopython (and perhaps even OBF projects generally). Copyright concerns in particular worry me (eg can this output be copyright and put under our license)? |
|
Indeed, I believe it can! I replied on #5054 |
|
Thank you - as per my reply on #5054, what you describe sounds very reasonable if we as a project agree to accept AI generated code. I'm going to have to do more reading before forming an opinion. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5085 +/- ##
==========================================
+ Coverage 85.45% 86.28% +0.83%
==========================================
Files 286 282 -4
Lines 59854 59457 -397
==========================================
+ Hits 51147 51305 +158
+ Misses 8707 8152 -555 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Absolutely! I'm happy to contribute to any discussion regarding that, I'm no GitHub expert but I do believe very strongly in the power of (careful supervision of) AI agents and I'm happy to create any number of demonstrations to show it. Re non-trivial vibing, I love pushing the boundaries of these models' capabilities! Did you know that something like 95% of the code for Claude Code is actually written by Claude? Developers internally run sessions now for as long as 12 hours https://x.com/i/broadcasts/1vOxwdBqRwgKB |
|
@rhowardstone Have you done some timings to find out why the C code is faster? We may get a similar speedup by simply modifying the Python code. |
|
Cross reference https://mailman.open-bio.org/pipermail/biopython/2025-November/017094.html for another generative AI suggestion for Biopython. I just posted a blog about my thoughts on receiving generative AI contributions as an Open Source project maintainer: https://blastedbio.blogspot.com/2025/11/thoughts-on-generative-ai-contributions.html In short I am sceptical, but to paraphrase myself from earlier in this thread - this PR seems like one of the best cases, but even here there is a real issue with a small pool of reviewers simply because this is in C rather than Python. Thank you. |
|
@peterjc Interesting, very pragmatic post! Sorry for the delay; I've been swamped defending my thesis and applying to jobs. Got a very humbling laugh out of the "(often during their PhD)" line :) Insistence on Python for code that is in any part AI-generated (perhaps, over some threshold in length, 5-10 lines could be trivial), I suppose, is one way of reducing the impacts of the quality/maintenance issues that arise here. There exists more (natural) training data for common languages, so code quality may be greater, and as you point out, you get a larger reviewer pool. @mdehoon I'm happy to convert and benchmark! Are there other languages or variants the community would feel comfortable maintaining, like, Cython perhaps? It's quite simple to convert moderately, even complex code between languages now, at least with a robust test suite in place. |
|
@rhowardstone Thank you. First step is to figure out where the speedup is coming from. Once we know that, we may be able to find a way to get a similar speedup in Python or numpy. |
|
Agent logfiles (for true reproduction, technically): chatlogs.zip From the Claude Opus 4.5 instance that produces the above (zipped reproduction package):
So if we can verify this, it appears we're able to get ~3x speedup over the current implementation using pure python. That implementation is included in the zip file. Does this analysis (README.md) fit with your intuition? |
|
It makes sense that the deepest loop takes most of the time. It may be possible to speed the plain Python code up further by using try:
seq_bytes = bytes(seq) # this works for Seq objects and SeqRecord objects
except TypeError:
seq_bytes = bytes(seq, encoding='ascii') # this works for plain strings
seq_indices = np.frombuffer(seq_bytes, dtype=np.int8)and then use these to index into the lookup tables. seq_bytes.count(b'C')etc. to count the CG fraction. I guess that then we get timings that are close, or close enough, to the C implementation. |
|
Indeed, I believe that shaves some additional time off! However it doesn't get close to the ~10x speedup in C. I suppose, a mere 2.5-3x speedup is better than difficult to maintain code? |
|
@rhowardstone Thank you for the timing. Can you please add your comments to this discussion, instead of attaching it as a file? That would make it easier for everybody to follow the discussion. |
|
Sure, here's the markdown file Claude provides after testing the optimization approaches:
|
|
Would you like me to have this minimal Python patch implemented, tested, and prepared as a PR? |
Complete C Acceleration for Melting Temperature Calculations
Summary
Comprehensive C extension providing 10-30x speedup for
Bio.SeqUtils.MeltingTemp.Tm_NNwith full support for all thermodynamic tables, mismatches, and dangling ends.This PR supersedes #5054 with significantly broader functionality and performance.
Features
✅ Complete Table Support
✅ Advanced Features
✅ API Compatibility
Tm_NN()Python APIPerformance
Testing
✅ All existing BioPython tests pass
✅ Verified numerical accuracy across all 8 tables
✅ Tested all salt correction methods (0-7)
✅ Edge cases verified (short sequences, all-AT, all-GC, long sequences)
python Tests/run_tests.py --offline test_SeqUtils # Result: All tests pass ✓Comparison with PR #5054
Implementation Details
Files Added:
Bio/SeqUtils/_meltingtemp_complete.c(720 lines)Bio/SeqUtils/_thermodynamic_tables.h(437 lines)Files Modified:
setup.py- Added extension build configurationBased On
This implementation combines:
Licensing
I agree to dual license this contribution under both the Biopython License Agreement and the BSD 3-Clause License as per BioPython contribution guidelines.
Code Generation
This implementation was generated with Claude Code, reviewed, tested, and verified by Rye Howard-Stone to ensure correctness and adherence to BioPython standards. The original code comes from a project called AmpliconHunter2: a SIMD-accelerated in-silico PCR engine in C. My original AmpliconHunter program used BioPython's Tm_NN function, so to retain full functionality and backwards compatibility when switching to C, I had to re-create this function in C. This PR follows directly from that work.
Checklist
Notes for Reviewers
This is a comprehensive enhancement that provides significant value for high-throughput primer design applications. The C extension is optional - BioPython will seamlessly fall back to Python if the extension is unavailable.
Key advantages over PR #5054:
Happy to make any adjustments needed to meet project standards!
Related: Supersedes #5054