annotate roundup/dehtml.py @ 8527:d4a43d9da8ef

chore(build): build(deps): bump anchore/scan-action from 7.3.1 to 7.3.2 pull #82
author John Rouillard <rouilj@ieee.org>
date Mon, 23 Feb 2026 20:16:55 -0500
parents 520075b29474
children 9c3ec0a5c7fc
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
1
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
2 from __future__ import print_function
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
3
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
4 import sys
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
5
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
6 from roundup.anypy.strings import u2s, uchr
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
7
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
8 # ruff PLC0415 ignore imports not at top of file
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
9 # ruff RET505 ignore else after return
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
10 # ruff: noqa: PLC0415 RET505
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
11
6110
af81e7a4302f don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5997
diff changeset
12 _pyver = sys.version_info[0]
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
13
7228
07ce4e4110f5 flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents: 6669
diff changeset
14
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
15 class dehtml:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
16 def __init__(self, converter):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
17 if converter == "none":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
18 self.html2text = None
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
19 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
20
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
21 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
22 if converter == "beautifulsoup":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
23 # Not as well tested as dehtml.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
24 from bs4 import BeautifulSoup
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
25
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
26 def html2text(html):
6669
ef0975b4291b Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents: 6110
diff changeset
27 soup = BeautifulSoup(html, "html.parser")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
28
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
29 # kill all script and style elements
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
30 for script in soup(["script", "style"]):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
31 script.extract()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
32
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
33 return u2s(soup.get_text("\n", strip=True))
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
34
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
35 self.html2text = html2text
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
36 elif converter == "justhtml":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
37 from justhtml import stream
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
38
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
39 def html2text(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
40 # The below does not work.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
41 # Using stream parser since I couldn't seem to strip
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
42 # 'script' and 'style' blocks. But stream doesn't
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
43 # have error reporting or stripping of text nodes
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
44 # and dropping empty nodes. Also I would like to try
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
45 # its GFM markdown output too even though it keeps
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
46 # tables as html and doesn't completely covert as
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
47 # this would work well for those supporting markdown.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
48 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
49 # ctx used for for testing since I have a truncated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
50 # test doc. It eliminates error from missing DOCTYPE
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
51 # and head.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
52 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
53 #from justhtml import JustHTML
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
54 # from justhtml.context import FragmentContext
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
55 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
56 #ctx = FragmentContext('html')
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
57 #justhtml = JustHTML(html,collect_errors=True,
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
58 # fragment_context=ctx)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
59 # I still have the text output inside style/script tags.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
60 # with :not(style, script). I do get text contents
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
61 # with query("style, script").
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
62 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
63 #return u2s("\n".join(
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
64 # [elem.to_text(separator="\n", strip=True)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
65 # for elem in justhtml.query(":not(style, script)")])
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
66 # )
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
67
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
68 # define inline elements so I can accumulate all unbroken
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
69 # text in a single line with embedded inline elements.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
70 # 'br' is inline but should be treated it as a line break
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
71 # and element before/after should not be accumulated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
72 # together.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
73 inline_elements = (
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
74 "a",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
75 "address",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
76 "b",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
77 "cite",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
78 "code",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
79 "em",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
80 "i",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
81 "img",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
82 "mark",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
83 "q",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
84 "s",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
85 "small",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
86 "span",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
87 "strong",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
88 "sub",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
89 "sup",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
90 "time")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
91
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
92 # each line is appended and joined at the end
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
93 text = []
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
94 # the accumulator for all text in inline elements
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
95 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
96 # if set skip all lines till matching end tag found
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
97 # used to skip script/style blocks
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
98 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
99 # used to force text_accumulator into text with added
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
100 # newline so we have a blank line between paragraphs.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
101 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
102
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
103 for event, data in stream(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
104 if event == "end" and skip_till_endtag == data:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
105 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
106 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
107 if skip_till_endtag:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
108 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
109 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
110 data[0] in ('script', 'style')):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
111 skip_till_endtag = data[0]
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
112 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
113 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
114 text_accumulator and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
115 data[0] not in inline_elements):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
116 # add accumulator to "text"
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
117 text.append(text_accumulator)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
118 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
119 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
120 elif event == "text":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
121 if not data.isspace():
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
122 text_accumulator = text_accumulator + data
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
123 _need_parabreak = True
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
124 elif (_need_parabreak and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
125 event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
126 data[0] == "p"):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
127 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
128 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
129 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
130
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
131 # save anything left in the accumulator at end of document
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
132 if text_accumulator:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
133 # add newline to match dehtml and beautifulsoup
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
134 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
135 return u2s("\n".join(text))
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
136
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
137 self.html2text = html2text
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
138 else:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
139 raise ImportError
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
140 except ImportError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
141 # use the fallback below if beautiful soup is not installed.
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
142 try:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
143 # Python 3+.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
144 from html.entities import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
145 from html.parser import HTMLParser
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
146 except ImportError:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
147 # Python 2.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
148 from htmlentitydefs import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
149 from HTMLParser import HTMLParser
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
150
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
151 class DumbHTMLParser(HTMLParser):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
152 # class attribute
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
153 text = ""
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
154
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
155 # internal state variable
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
156 _skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
157 _last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
158
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
159 def handle_data(self, data):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
160 if self._skip_data: # skip data in script or style block
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
161 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
162
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
163 if (data.strip() == ""):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
164 # reduce multiple blank lines to 1
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
165 if (self._last_empty):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
166 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
167 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
168 self._last_empty = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
169 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
170 self._last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
171
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
172 self.text = self.text + data
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
173
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
174 def handle_starttag(self, tag, attrs): # noqa: ARG002
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
175 if (tag == "p"):
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
176 self.text = self.text + "\n"
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
177 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
178 self._skip_data = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
179
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
180 def handle_endtag(self, tag):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
181 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
182 self._skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
183
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
184 def handle_entityref(self, name):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
185 if self._skip_data:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
186 return
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
187 c = uchr(name2codepoint[name])
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
188 try:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
189 self.text = self.text + c
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
190 except UnicodeEncodeError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
191 # print a space as a placeholder
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
192 self.text = self.text + " "
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
193
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
194 def html2text(html):
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
195 parser = DumbHTMLParser(
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
196 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
197 parser.feed(html)
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
198 parser.close()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
199 return parser.text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
200
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
201 self.html2text = html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
202
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
203
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
204 if __name__ == "__main__":
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
205 # ruff: noqa: B011 S101
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
206
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
207 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
208 assert False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
209 except AssertionError:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
210 pass
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
211 else:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
212 print("Error, assertions turned off. Test fails")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
213 sys.exit(1)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
214
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
215 html = """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
216 <body>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
217 <script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
218 this must not be in output
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
219 </script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
220 <style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
221 p {display:block}
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
222 </style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
223 <div class="header"><h1>Roundup</h1>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
224 <div id="searchbox" style="display: none">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
225 <form class="search" action="../search.html" method="get">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
226 <input type="text" name="q" size="18" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
227 <input type="submit" value="Search" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
228 <input type="hidden" name="check_keywords" value="yes" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
229 <input type="hidden" name="area" value="default" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
230 </form>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
231 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
232 <script type="text/javascript">$('#searchbox').show(0);</script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
233 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
234 <ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
235 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
236 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
237 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
238 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
239 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
240 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
241 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
242 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
243 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
244 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
245 </ul>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
246 <div class="section" id="prerequisites">
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
247 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
248 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
249 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>.
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
250 It is highly recommended that users install the <span>latest patch version</span>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
251 of python as these contain many fixes to serious bugs.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
252 <p>Some variants of Linux will need an additional &#8220;python dev&#8221; package
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
253 installed for Roundup installation to work. Debian and derivatives, are
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
254 known to require this.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
255 <p>If you&#8217;re on windows, you will either need to be using the ActiveState python
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
256 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you&#8217;ll
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
257 have to install the win32all package separately (get it from
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
258 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p>
5838
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
259 <script>
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
260 &lt; HELP &GT;
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
261 </script>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
262 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
263 </body>
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
264 """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
265
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
266 if len(sys.argv) > 1:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
267 with open(sys.argv[1]) as h:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
268 html = h.read()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
269
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
270 print("==== beautifulsoup")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
271 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
272 # trap error seen if N_TOKENS not defined when run.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
273 html2text = dehtml("beautifulsoup").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
274 if html2text:
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
275 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
276 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
277 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
278 print(text)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
279 except NameError as e:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
280 print("captured error %s" % e)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
281
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
282 print("==== justhtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
283 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
284 html2text = dehtml("justhtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
285 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
286 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
287 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
288 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
289 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
290 except NameError as e:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
291 print("captured error %s" % e)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
292
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
293 print("==== dehtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
294 html2text = dehtml("dehtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
295 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
296 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
297 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
298 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
299 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
300
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
301 print("==== disabled html -> text conversion")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
302 html2text = dehtml("none").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
303 if html2text:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
304 print("FAIL: Error, dehtml(none) is returning a function")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
305 else:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
306 print("PASS: dehtml(none) is returning None")

Roundup Issue Tracker: http://roundup-tracker.org/