annotate roundup/dehtml.py @ 8580:5cba36e42b8f

chore: refactor replace urlparse with urlsplit and use urllib_ Python docs recommend use of urlsplit() rather than urlparse(). urlsplit() is a little faster and doesn't try to split the path into path and params using the rules from an obsolete RFC. actions.py, demo.py, rest.py, client.py Replace urlparse() with urlsplit() actions.py urlsplit() produces a named tuple with one fewer elements (no .param). So fixup calls to urlunparse() so they have the proper number of elements in the tuple. also merge url filtering for param and path. demo.py, rest.py: Replace imports from urlparse/urllib.parse with roundup.anypy.urllib_ so we use the same interface throughout the code base. test/test_cgi.py: Since actions.py filtering for invali urls not split by path/param, fix tests for improperly quoted url's.
author John Rouillard <rouilj@ieee.org>
date Sun, 19 Apr 2026 22:58:59 -0400
parents 9c3ec0a5c7fc
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
1
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
2 import sys
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
3
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
4 from roundup.anypy.strings import u2s, uchr
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
5
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
6 # ruff PLC0415 ignore imports not at top of file
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
7 # ruff RET505 ignore else after return
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
8 # ruff: noqa: PLC0415 RET505
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
9
6110
af81e7a4302f don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5997
diff changeset
10 _pyver = sys.version_info[0]
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
11
7228
07ce4e4110f5 flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents: 6669
diff changeset
12
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
13 class dehtml:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
14 def __init__(self, converter):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
15 if converter == "none":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
16 self.html2text = None
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
17 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
18
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
19 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
20 if converter == "beautifulsoup":
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
21 # Not as well tested as dehtml.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
22 from bs4 import BeautifulSoup
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
23
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
24 def html2text(html):
6669
ef0975b4291b Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents: 6110
diff changeset
25 soup = BeautifulSoup(html, "html.parser")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
26
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
27 # kill all script and style elements
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
28 for script in soup(["script", "style"]):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
29 script.extract()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
30
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
31 return u2s(soup.get_text("\n", strip=True))
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
32
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
33 self.html2text = html2text
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
34 elif converter == "justhtml":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
35 from justhtml import stream
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
36
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
37 def html2text(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
38 # The below does not work.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
39 # Using stream parser since I couldn't seem to strip
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
40 # 'script' and 'style' blocks. But stream doesn't
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
41 # have error reporting or stripping of text nodes
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
42 # and dropping empty nodes. Also I would like to try
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
43 # its GFM markdown output too even though it keeps
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
44 # tables as html and doesn't completely covert as
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
45 # this would work well for those supporting markdown.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
46 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
47 # ctx used for for testing since I have a truncated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
48 # test doc. It eliminates error from missing DOCTYPE
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
49 # and head.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
50 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
51 #from justhtml import JustHTML
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
52 # from justhtml.context import FragmentContext
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
53 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
54 #ctx = FragmentContext('html')
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
55 #justhtml = JustHTML(html,collect_errors=True,
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
56 # fragment_context=ctx)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
57 # I still have the text output inside style/script tags.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
58 # with :not(style, script). I do get text contents
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
59 # with query("style, script").
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
60 #
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
61 #return u2s("\n".join(
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
62 # [elem.to_text(separator="\n", strip=True)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
63 # for elem in justhtml.query(":not(style, script)")])
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
64 # )
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
65
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
66 # define inline elements so I can accumulate all unbroken
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
67 # text in a single line with embedded inline elements.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
68 # 'br' is inline but should be treated it as a line break
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
69 # and element before/after should not be accumulated
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
70 # together.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
71 inline_elements = (
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
72 "a",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
73 "address",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
74 "b",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
75 "cite",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
76 "code",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
77 "em",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
78 "i",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
79 "img",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
80 "mark",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
81 "q",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
82 "s",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
83 "small",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
84 "span",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
85 "strong",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
86 "sub",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
87 "sup",
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
88 "time")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
89
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
90 # each line is appended and joined at the end
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
91 text = []
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
92 # the accumulator for all text in inline elements
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
93 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
94 # if set skip all lines till matching end tag found
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
95 # used to skip script/style blocks
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
96 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
97 # used to force text_accumulator into text with added
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
98 # newline so we have a blank line between paragraphs.
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
99 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
100
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
101 for event, data in stream(html):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
102 if event == "end" and skip_till_endtag == data:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
103 skip_till_endtag = None
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
104 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
105 if skip_till_endtag:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
106 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
107 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
108 data[0] in ('script', 'style')):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
109 skip_till_endtag = data[0]
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
110 continue
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
111 if (event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
112 text_accumulator and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
113 data[0] not in inline_elements):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
114 # add accumulator to "text"
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
115 text.append(text_accumulator)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
116 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
117 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
118 elif event == "text":
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
119 if not data.isspace():
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
120 text_accumulator = text_accumulator + data
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
121 _need_parabreak = True
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
122 elif (_need_parabreak and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
123 event == "start" and
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
124 data[0] == "p"):
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
125 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
126 text_accumulator = ""
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
127 _need_parabreak = False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
128
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
129 # save anything left in the accumulator at end of document
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
130 if text_accumulator:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
131 # add newline to match dehtml and beautifulsoup
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
132 text.append(text_accumulator + "\n")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
133 return u2s("\n".join(text))
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
134
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
135 self.html2text = html2text
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
136 else:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
137 raise ImportError
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
138 except ImportError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
139 # use the fallback below if beautiful soup is not installed.
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
140 try:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
141 # Python 3+.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
142 from html.entities import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
143 from html.parser import HTMLParser
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
144 except ImportError:
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
145 # Python 2.
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
146 from htmlentitydefs import name2codepoint
5411
9c6d98bf79db Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5376
diff changeset
147 from HTMLParser import HTMLParser
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
148
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
149 class DumbHTMLParser(HTMLParser):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
150 # class attribute
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
151 text = ""
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
152
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
153 # internal state variable
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
154 _skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
155 _last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
156
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
157 def handle_data(self, data):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
158 if self._skip_data: # skip data in script or style block
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
159 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
160
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
161 if (data.strip() == ""):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
162 # reduce multiple blank lines to 1
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
163 if (self._last_empty):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
164 return
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
165 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
166 self._last_empty = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
167 else:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
168 self._last_empty = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
169
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
170 self.text = self.text + data
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
171
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
172 def handle_starttag(self, tag, attrs): # noqa: ARG002
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
173 if (tag == "p"):
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
174 self.text = self.text + "\n"
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
175 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
176 self._skip_data = True
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
177
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
178 def handle_endtag(self, tag):
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
179 if (tag in ("style", "script")):
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
180 self._skip_data = False
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
181
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
182 def handle_entityref(self, name):
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
183 if self._skip_data:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
184 return
5417
c749d6795bc2 Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5416
diff changeset
185 c = uchr(name2codepoint[name])
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
186 try:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
187 self.text = self.text + c
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
188 except UnicodeEncodeError:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
189 # print a space as a placeholder
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
190 self.text = self.text + " "
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
191
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
192 def html2text(html):
7833
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
193 parser = DumbHTMLParser(
b68a1d8fd5d9 chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents: 7756
diff changeset
194 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
195 parser.feed(html)
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
196 parser.close()
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
197 return parser.text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
198
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
199 self.html2text = html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
200
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
201
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
202 if __name__ == "__main__":
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
203 # ruff: noqa: B011 S101
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
204
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
205 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
206 assert False
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
207 except AssertionError:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
208 pass
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
209 else:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
210 print("Error, assertions turned off. Test fails")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
211 sys.exit(1)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
212
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
213 html = """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
214 <body>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
215 <script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
216 this must not be in output
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
217 </script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
218 <style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
219 p {display:block}
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
220 </style>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
221 <div class="header"><h1>Roundup</h1>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
222 <div id="searchbox" style="display: none">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
223 <form class="search" action="../search.html" method="get">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
224 <input type="text" name="q" size="18" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
225 <input type="submit" value="Search" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
226 <input type="hidden" name="check_keywords" value="yes" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
227 <input type="hidden" name="area" value="default" />
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
228 </form>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
229 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
230 <script type="text/javascript">$('#searchbox').show(0);</script>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
231 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
232 <ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
233 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
234 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
235 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current">
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
236 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
237 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
238 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
239 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
240 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
241 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
242 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
243 </ul>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
244 <div class="section" id="prerequisites">
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
245 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
246 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
247 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>.
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
248 It is highly recommended that users install the <span>latest patch version</span>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
249 of python as these contain many fixes to serious bugs.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
250 <p>Some variants of Linux will need an additional &#8220;python dev&#8221; package
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
251 installed for Roundup installation to work. Debian and derivatives, are
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
252 known to require this.</p>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
253 <p>If you&#8217;re on windows, you will either need to be using the ActiveState python
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
254 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you&#8217;ll
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
255 have to install the win32all package separately (get it from
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
256 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p>
5838
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
257 <script>
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
258 &lt; HELP &GT;
b74f0b50bef1 Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents: 5417
diff changeset
259 </script>
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
260 </div>
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
261 </body>
7756
6079440ac023 chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents: 7228
diff changeset
262 """
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
263
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
264 if len(sys.argv) > 1:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
265 with open(sys.argv[1]) as h:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
266 html = h.read()
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
267
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
268 print("==== beautifulsoup")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
269 try:
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
270 # trap error seen if N_TOKENS not defined when run.
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
271 html2text = dehtml("beautifulsoup").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
272 if html2text:
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
273 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
274 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
275 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
276 print(text)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
277 except NameError as e:
5997
1700542408f3 flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents: 5838
diff changeset
278 print("captured error %s" % e)
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
279
8491
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
280 print("==== justhtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
281 try:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
282 html2text = dehtml("justhtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
283 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
284 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
285 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
286 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
287 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
288 except NameError as e:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
289 print("captured error %s" % e)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
290
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
291 print("==== dehtml")
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
292 html2text = dehtml("dehtml").html2text
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
293 if html2text:
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
294 text = html2text(html)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
295 assert ('HELP' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
296 assert ('display:block' not in text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
297 print(text)
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
298
520075b29474 feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents: 7833
diff changeset
299 print("==== disabled html -> text conversion")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
300 html2text = dehtml("none").html2text
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
301 if html2text:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
302 print("FAIL: Error, dehtml(none) is returning a function")
5305
e20f472fde7d issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
303 else:
5376
64b05e24dbd8 Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5305
diff changeset
304 print("PASS: dehtml(none) is returning None")

Roundup Issue Tracker: http://roundup-tracker.org/