comparison roundup/configuration.py @ 8491:520075b29474

feat: support justhtml parsing library to convert email to plain text justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
author John Rouillard <rouilj@ieee.org>
date Sun, 14 Dec 2025 22:40:46 -0500
parents 7142740e6547
children
comparison
equal deleted inserted replaced
8490:918792e35e0c 8491:520075b29474
382 382
383 class HtmlToTextOption(Option): 383 class HtmlToTextOption(Option):
384 384
385 """What module should be used to convert emails with only text/html 385 """What module should be used to convert emails with only text/html
386 parts into text for display in roundup. Choose from beautifulsoup 386 parts into text for display in roundup. Choose from beautifulsoup
387 4, dehtml - the internal code or none to disable html to text 387 4, justhtml, dehtml - the internal code or none to disable html to
388 conversion. If beautifulsoup chosen but not available, dehtml will 388 text conversion. If beautifulsoup or justhtml is chosen but not
389 be used. 389 available, dehtml will be used.
390 390
391 """ 391 """
392 392
393 class_description = "Allowed values: beautifulsoup, dehtml, none" 393 class_description = "Allowed values: beautifulsoup, justhtml, dehtml, none"
394 394
395 def str2value(self, value): 395 def str2value(self, value):
396 _val = value.lower() 396 _val = value.lower()
397 if _val in ("beautifulsoup", "dehtml", "none"): 397 if _val in ("beautifulsoup", "justhtml", "dehtml", "none"):
398 return _val 398 return _val
399 else: 399 else:
400 raise OptionValueError(self, value, self.class_description) 400 raise OptionValueError(self, value, self.class_description)
401 401
402 402
1809 "parts of the multipart/alternative are ignored. The default\n" 1809 "parts of the multipart/alternative are ignored. The default\n"
1810 "is to keep all parts and attach them to the issue."), 1810 "is to keep all parts and attach them to the issue."),
1811 (HtmlToTextOption, "convert_htmltotext", "none", 1811 (HtmlToTextOption, "convert_htmltotext", "none",
1812 "If an email has only text/html parts, use this module\n" 1812 "If an email has only text/html parts, use this module\n"
1813 "to convert the html to text. Choose from beautifulsoup 4,\n" 1813 "to convert the html to text. Choose from beautifulsoup 4,\n"
1814 "dehtml - (internal code), or none to disable conversion.\n" 1814 "justhtml, dehtml - (internal code), or none to disable\n"
1815 "If 'none' is selected, email without a text/plain part\n" 1815 "conversion. If 'none' is selected, email without a text/plain\n"
1816 "will be returned to the user with a message. If\n" 1816 "part will be returned to the user with a message. If\n"
1817 "beautifulsoup is selected but not installed dehtml will\n" 1817 "beautifulsoup or justhtml is selected but not installed\n"
1818 "be used instead."), 1818 "dehtml will be used instead."),
1819 (BooleanOption, "keep_real_from", "no", 1819 (BooleanOption, "keep_real_from", "no",
1820 "When handling emails ignore the Resent-From:-header\n" 1820 "When handling emails ignore the Resent-From:-header\n"
1821 "and use the original senders From:-header instead.\n" 1821 "and use the original senders From:-header instead.\n"
1822 "(This might be desirable in some situations where a moderator\n" 1822 "(This might be desirable in some situations where a moderator\n"
1823 "reads incoming messages first before bouncing them to Roundup)", 1823 "reads incoming messages first before bouncing them to Roundup)",

Roundup Issue Tracker: http://roundup-tracker.org/