diff roundup/configuration.py @ 8491:520075b29474

feat: support justhtml parsing library to convert email to plain text justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
author John Rouillard <rouilj@ieee.org>
date Sun, 14 Dec 2025 22:40:46 -0500
parents 7142740e6547
children
line wrap: on
line diff
--- a/roundup/configuration.py	Sat Dec 13 23:02:53 2025 -0500
+++ b/roundup/configuration.py	Sun Dec 14 22:40:46 2025 -0500
@@ -384,17 +384,17 @@
 
     """What module should be used to convert emails with only text/html
     parts into text for display in roundup. Choose from beautifulsoup
-    4, dehtml - the internal code or none to disable html to text
-    conversion. If beautifulsoup chosen but not available, dehtml will
-    be used.
+    4, justhtml, dehtml - the internal code or none to disable html to
+    text conversion. If beautifulsoup or justhtml is chosen but not
+    available, dehtml will be used.
 
     """
 
-    class_description = "Allowed values: beautifulsoup, dehtml, none"
+    class_description = "Allowed values: beautifulsoup, justhtml, dehtml, none"
 
     def str2value(self, value):
         _val = value.lower()
-        if _val in ("beautifulsoup", "dehtml", "none"):
+        if _val in ("beautifulsoup", "justhtml", "dehtml", "none"):
             return _val
         else:
             raise OptionValueError(self, value, self.class_description)
@@ -1811,11 +1811,11 @@
         (HtmlToTextOption, "convert_htmltotext", "none",
             "If an email has only text/html parts, use this module\n"
             "to convert the html to text. Choose from beautifulsoup 4,\n"
-            "dehtml - (internal code), or none to disable conversion.\n"
-            "If 'none' is selected, email without a text/plain part\n"
-            "will be returned to the user with a message. If\n"
-            "beautifulsoup is selected but not installed dehtml will\n"
-            "be used instead."),
+            "justhtml, dehtml - (internal code), or none to disable\n"
+            "conversion. If 'none' is selected, email without a text/plain\n"
+            "part will be returned to the user with a message. If\n"
+            "beautifulsoup or justhtml is selected but not installed\n"
+            "dehtml will be used instead."),
         (BooleanOption, "keep_real_from", "no",
             "When handling emails ignore the Resent-From:-header\n"
             "and use the original senders From:-header instead.\n"

Roundup Issue Tracker: http://roundup-tracker.org/