comparison doc/tracker_config.txt @ 8491:520075b29474

feat: support justhtml parsing library to convert email to plain text justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
author John Rouillard <rouilj@ieee.org>
date Sun, 14 Dec 2025 22:40:46 -0500
parents 6e44b3b20df2
children d0dfb4085e94
comparison
equal deleted inserted replaced
8490:918792e35e0c 8491:520075b29474
1110 # Default: no 1110 # Default: no
1111 ignore_alternatives = no 1111 ignore_alternatives = no
1112 1112
1113 # If an email has only text/html parts, use this module 1113 # If an email has only text/html parts, use this module
1114 # to convert the html to text. Choose from beautifulsoup 4, 1114 # to convert the html to text. Choose from beautifulsoup 4,
1115 # dehtml - (internal code), or none to disable conversion. 1115 # justhtml, dehtml - (internal code), or none to disable
1116 # If 'none' is selected, email without a text/plain part 1116 # conversion. If 'none' is selected, email without a text/plain
1117 # will be returned to the user with a message. If 1117 # part will be returned to the user with a message. If
1118 # beautifulsoup is selected but not installed dehtml will 1118 # beautifulsoup is selected but not installed dehtml will
1119 # be used instead. 1119 # be used instead.
1120 # Allowed values: beautifulsoup, dehtml, none 1120 # Allowed values: beautifulsoup, justhtml, dehtml, none
1121 # Default: none 1121 # Default: none
1122 convert_htmltotext = none 1122 convert_htmltotext = none
1123 1123
1124 # When handling emails ignore the Resent-From:-header 1124 # When handling emails ignore the Resent-From:-header
1125 # and use the original senders From:-header instead. 1125 # and use the original senders From:-header instead.

Roundup Issue Tracker: http://roundup-tracker.org/