annotate roundup/dehtml.py @ 8491:520075b29474

feat: support justhtml parsing library to convert email to plain text justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
author John Rouillard <rouilj@ieee.org>
date Sun, 14 Dec 2025 22:40:46 -0500
parents b68a1d8fd5d9
children 9c3ec0a5c7fc
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
1
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
2 from __future__ import print_function
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
3
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
4 import sys
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
5
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
6 from roundup.anypy.strings import u2s, uchr
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
7
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
8 # ruff PLC0415 ignore imports not at top of file
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
9 # ruff RET505 ignore else after return
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
10 # ruff: noqa: PLC0415 RET505
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
11
6110
af81e7a4302f don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5997
diff changeset
12 _pyver = sys.version_info[0]
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
13
7228
07ce4e4110f5 flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents: 6669
diff changeset
14
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
15 class dehtml:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
16 def __init__(self, converter):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
17 if converter == "none":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
18 self.html2text = None
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
19 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
20
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
21 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
22 if converter == "beautifulsoup":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
23 # Not as well tested as dehtml.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
24 from bs4 import BeautifulSoup
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
25
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
26 def html2text(html):
6669
ef0975b4291b Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents: 6110
diff changeset
27 soup = BeautifulSoup(html, "html.parser")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
28
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
29 # kill all script and style elements
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
30 for script in soup(["script", "style"]):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
31 script.extract()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
32
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
33 return u2s(soup.get_text("\n", strip=True))
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
34
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
35 self.html2text = html2text
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
36 elif converter == "justhtml":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
37 from justhtml import stream
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
38
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
39 def html2text(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
40 # The below does not work.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
41 # Using stream parser since I couldn't seem to strip
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
42 # 'script' and 'style' blocks. But stream doesn't
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
43 # have error reporting or stripping of text nodes
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
44 # and dropping empty nodes. Also I would like to try
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
45 # its GFM markdown output too even though it keeps
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
46 # tables as html and doesn't completely covert as
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
47 # this would work well for those supporting markdown.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
48 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
49 # ctx used for for testing since I have a truncated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
50 # test doc. It eliminates error from missing DOCTYPE
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
51 # and head.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
52 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
53 #from justhtml import JustHTML
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
54 # from justhtml.context import FragmentContext
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
55 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
56 #ctx = FragmentContext('html')
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
57 #justhtml = JustHTML(html,collect_errors=True,
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
58 # fragment_context=ctx)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
59 # I still have the text output inside style/script tags.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
60 # with :not(style, script). I do get text contents
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
61 # with query("style, script").
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
62 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
63 #return u2s("\n".join(
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
64 # [elem.to_text(separator="\n", strip=True)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
65 # for elem in justhtml.query(":not(style, script)")])
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
66 # )
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
67
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
68 # define inline elements so I can accumulate all unbroken
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
69 # text in a single line with embedded inline elements.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
70 # 'br' is inline but should be treated it as a line break
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
71 # and element before/after should not be accumulated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
72 # together.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
73 inline_elements = (
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
74 "a",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
75 "address",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
76 "b",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
77 "cite",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
78 "code",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
79 "em",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
80 "i",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
81 "img",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
82 "mark",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
83 "q",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
84 "s",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
85 "small",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
86 "span",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
87 "strong",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
88 "sub",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
89 "sup",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
90 "time")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
91
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
92 # each line is appended and joined at the end
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
93 text = []
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
94 # the accumulator for all text in inline elements
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
95 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
96 # if set skip all lines till matching end tag found
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
97 # used to skip script/style blocks
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
98 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
99 # used to force text_accumulator into text with added
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
100 # newline so we have a blank line between paragraphs.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
101 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
102
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
103 for event, data in stream(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
104 if event == "end" and skip_till_endtag == data:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
105 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
106 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
107 if skip_till_endtag:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
108 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
109 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
110 data[0] in ('script', 'style')):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
111 skip_till_endtag = data[0]
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
112 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
113 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
114 text_accumulator and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
115 data[0] not in inline_elements):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
116 # add accumulator to "text"
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
117 text.append(text_accumulator)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
118 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
119 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
120 elif event == "text":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
121 if not data.isspace():
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
122 text_accumulator = text_accumulator + data
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
123 _need_parabreak = True
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
124 elif (_need_parabreak and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
125 event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
126 data[0] == "p"):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
127 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
128 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
129 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
130
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
131 # save anything left in the accumulator at end of document
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
132 if text_accumulator:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
133 # add newline to match dehtml and beautifulsoup
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
134 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
135 return u2s("\n".join(text))
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
136
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
137 self.html2text = html2text
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
138 else:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
139 raise ImportError
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
140 except ImportError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
141 # use the fallback below if beautiful soup is not installed.
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
142 try:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
143 # Python 3+.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
144 from html.entities import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
145 from html.parser import HTMLParser
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
146 except ImportError:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
147 # Python 2.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
148 from htmlentitydefs import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
149 from HTMLParser import HTMLParser
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
150
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
151 class DumbHTMLParser(HTMLParser):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
152 # class attribute
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
153 text = ""
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
154
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
155 # internal state variable
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
156 _skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
157 _last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
158
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
159 def handle_data(self, data):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
160 if self._skip_data: # skip data in script or style block
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
161 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
162
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
163 if (data.strip() == ""):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
164 # reduce multiple blank lines to 1
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
165 if (self._last_empty):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
166 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
167 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
168 self._last_empty = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
169 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
170 self._last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
171
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
172 self.text = self.text + data
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
173
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
174 def handle_starttag(self, tag, attrs): # noqa: ARG002
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
175 if (tag == "p"):
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
176 self.text = self.text + "\n"
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
177 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
178 self._skip_data = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
179
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
180 def handle_endtag(self, tag):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
181 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
182 self._skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
183
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
184 def handle_entityref(self, name):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
185 if self._skip_data:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
186 return
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
187 c = uchr(name2codepoint[name])
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
188 try:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
189 self.text = self.text + c
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
190 except UnicodeEncodeError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
191 # print a space as a placeholder
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
192 self.text = self.text + " "
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
193
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
194 def html2text(html):
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
195 parser = DumbHTMLParser(
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
196 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
197 parser.feed(html)
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
198 parser.close()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
199 return parser.text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
200
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
201 self.html2text = html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
202
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
203
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
204 if __name__ == "__main__":
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
205 # ruff: noqa: B011 S101
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
206
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
207 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
208 assert False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
209 except AssertionError:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
210 pass
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
211 else:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
212 print("Error, assertions turned off. Test fails")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
213 sys.exit(1)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
214
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
215 html = """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
216 <body>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
217 <script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
218 this must not be in output
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
219 </script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
220 <style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
221 p {display:block}
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
222 </style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
223 <div class="header"><h1>Roundup</h1>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
224 <div id="searchbox" style="display: none">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
225 <form class="search" action="../search.html" method="get">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
226 <input type="text" name="q" size="18" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
227 <input type="submit" value="Search" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
228 <input type="hidden" name="check_keywords" value="yes" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
229 <input type="hidden" name="area" value="default" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
230 </form>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
231 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
232 <script type="text/javascript">$('#searchbox').show(0);</script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
233 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
234 <ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
235 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
236 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
237 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
238 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
239 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
240 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
241 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
242 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
243 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
244 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
245 </ul>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
246 <div class="section" id="prerequisites">
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
247 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
248 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
249 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>.
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
250 It is highly recommended that users install the <span>latest patch version</span>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
251 of python as these contain many fixes to serious bugs.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
252 <p>Some variants of Linux will need an additional &#8220;python dev&#8221; package
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
253 installed for Roundup installation to work. Debian and derivatives, are
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
254 known to require this.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
255 <p>If you&#8217;re on windows, you will either need to be using the ActiveState python
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
256 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you&#8217;ll
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
257 have to install the win32all package separately (get it from
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
258 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p>
5838
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
259 <script>
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
260 &lt; HELP &GT;
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
261 </script>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
262 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
263 </body>
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
264 """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
265
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
266 if len(sys.argv) > 1:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
267 with open(sys.argv[1]) as h:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
268 html = h.read()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
269
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
270 print("==== beautifulsoup")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
271 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
272 # trap error seen if N_TOKENS not defined when run.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
273 html2text = dehtml("beautifulsoup").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
274 if html2text:
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
275 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
276 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
277 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
278 print(text)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
279 except NameError as e:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
280 print("captured error %s" % e)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
281
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
282 print("==== justhtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
283 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
284 html2text = dehtml("justhtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
285 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
286 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
287 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
288 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
289 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
290 except NameError as e:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
291 print("captured error %s" % e)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
292
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
293 print("==== dehtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
294 html2text = dehtml("dehtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
295 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
296 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
297 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
298 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
299 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
300
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
301 print("==== disabled html -> text conversion")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
302 html2text = dehtml("none").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
303 if html2text:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
304 print("FAIL: Error, dehtml(none) is returning a function")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
305 else:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
306 print("PASS: dehtml(none) is returning None")

Roundup Issue Tracker: http://roundup-tracker.org/