Mercurial > p > roundup > code
annotate roundup/dehtml.py @ 8580:5cba36e42b8f
chore: refactor replace urlparse with urlsplit and use urllib_
Python docs recommend use of urlsplit() rather than
urlparse(). urlsplit() is a little faster and doesn't try to split the
path into path and params using the rules from an obsolete RFC.
actions.py, demo.py, rest.py, client.py
Replace urlparse() with urlsplit()
actions.py
urlsplit() produces a named tuple with one fewer elements (no
.param). So fixup calls to urlunparse() so they have the proper
number of elements in the tuple.
also merge url filtering for param and path.
demo.py, rest.py:
Replace imports from urlparse/urllib.parse with
roundup.anypy.urllib_ so we use the same interface throughout the
code base.
test/test_cgi.py:
Since actions.py filtering for invali urls not split by path/param,
fix tests for improperly quoted url's.
| author | John Rouillard <rouilj@ieee.org> |
|---|---|
| date | Sun, 19 Apr 2026 22:58:59 -0400 |
| parents | 9c3ec0a5c7fc |
| children |
| rev | line source |
|---|---|
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
1 |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
2 import sys |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
3 |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
4 from roundup.anypy.strings import u2s, uchr |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
5 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
6 # ruff PLC0415 ignore imports not at top of file |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
7 # ruff RET505 ignore else after return |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
8 # ruff: noqa: PLC0415 RET505 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
9 |
|
6110
af81e7a4302f
don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents:
5997
diff
changeset
|
10 _pyver = sys.version_info[0] |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
11 |
|
7228
07ce4e4110f5
flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents:
6669
diff
changeset
|
12 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
13 class dehtml: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
14 def __init__(self, converter): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
15 if converter == "none": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
16 self.html2text = None |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
17 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
18 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
19 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
20 if converter == "beautifulsoup": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
21 # Not as well tested as dehtml. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
22 from bs4 import BeautifulSoup |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
23 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
24 def html2text(html): |
|
6669
ef0975b4291b
Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents:
6110
diff
changeset
|
25 soup = BeautifulSoup(html, "html.parser") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
26 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
27 # kill all script and style elements |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
28 for script in soup(["script", "style"]): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
29 script.extract() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
30 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
31 return u2s(soup.get_text("\n", strip=True)) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
32 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
33 self.html2text = html2text |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
34 elif converter == "justhtml": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
35 from justhtml import stream |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
36 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
37 def html2text(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
38 # The below does not work. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
39 # Using stream parser since I couldn't seem to strip |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
40 # 'script' and 'style' blocks. But stream doesn't |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
41 # have error reporting or stripping of text nodes |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
42 # and dropping empty nodes. Also I would like to try |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
43 # its GFM markdown output too even though it keeps |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
44 # tables as html and doesn't completely covert as |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
45 # this would work well for those supporting markdown. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
46 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
47 # ctx used for for testing since I have a truncated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
48 # test doc. It eliminates error from missing DOCTYPE |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
49 # and head. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
50 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
51 #from justhtml import JustHTML |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
52 # from justhtml.context import FragmentContext |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
53 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
54 #ctx = FragmentContext('html') |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
55 #justhtml = JustHTML(html,collect_errors=True, |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
56 # fragment_context=ctx) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
57 # I still have the text output inside style/script tags. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
58 # with :not(style, script). I do get text contents |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
59 # with query("style, script"). |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
60 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
61 #return u2s("\n".join( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
62 # [elem.to_text(separator="\n", strip=True) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
63 # for elem in justhtml.query(":not(style, script)")]) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
64 # ) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
65 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
66 # define inline elements so I can accumulate all unbroken |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
67 # text in a single line with embedded inline elements. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
68 # 'br' is inline but should be treated it as a line break |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
69 # and element before/after should not be accumulated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
70 # together. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
71 inline_elements = ( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
72 "a", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
73 "address", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
74 "b", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
75 "cite", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
76 "code", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
77 "em", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
78 "i", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
79 "img", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
80 "mark", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
81 "q", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
82 "s", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
83 "small", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
84 "span", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
85 "strong", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
86 "sub", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
87 "sup", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
88 "time") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
89 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
90 # each line is appended and joined at the end |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
91 text = [] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
92 # the accumulator for all text in inline elements |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
93 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
94 # if set skip all lines till matching end tag found |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
95 # used to skip script/style blocks |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
96 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
97 # used to force text_accumulator into text with added |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
98 # newline so we have a blank line between paragraphs. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
99 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
100 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
101 for event, data in stream(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
102 if event == "end" and skip_till_endtag == data: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
103 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
104 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
105 if skip_till_endtag: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
106 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
107 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
108 data[0] in ('script', 'style')): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
109 skip_till_endtag = data[0] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
110 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
111 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
112 text_accumulator and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
113 data[0] not in inline_elements): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
114 # add accumulator to "text" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
115 text.append(text_accumulator) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
116 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
117 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
118 elif event == "text": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
119 if not data.isspace(): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
120 text_accumulator = text_accumulator + data |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
121 _need_parabreak = True |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
122 elif (_need_parabreak and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
123 event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
124 data[0] == "p"): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
125 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
126 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
127 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
128 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
129 # save anything left in the accumulator at end of document |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
130 if text_accumulator: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
131 # add newline to match dehtml and beautifulsoup |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
132 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
133 return u2s("\n".join(text)) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
134 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
135 self.html2text = html2text |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
136 else: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
137 raise ImportError |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
138 except ImportError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
139 # use the fallback below if beautiful soup is not installed. |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
140 try: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
141 # Python 3+. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
142 from html.entities import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
143 from html.parser import HTMLParser |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
144 except ImportError: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
145 # Python 2. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
146 from htmlentitydefs import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
147 from HTMLParser import HTMLParser |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
148 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
149 class DumbHTMLParser(HTMLParser): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
150 # class attribute |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
151 text = "" |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
152 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
153 # internal state variable |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
154 _skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
155 _last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
156 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
157 def handle_data(self, data): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
158 if self._skip_data: # skip data in script or style block |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
159 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
160 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
161 if (data.strip() == ""): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
162 # reduce multiple blank lines to 1 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
163 if (self._last_empty): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
164 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
165 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
166 self._last_empty = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
167 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
168 self._last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
169 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
170 self.text = self.text + data |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
171 |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
172 def handle_starttag(self, tag, attrs): # noqa: ARG002 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
173 if (tag == "p"): |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
174 self.text = self.text + "\n" |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
175 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
176 self._skip_data = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
177 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
178 def handle_endtag(self, tag): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
179 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
180 self._skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
181 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
182 def handle_entityref(self, name): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
183 if self._skip_data: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
184 return |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
185 c = uchr(name2codepoint[name]) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
186 try: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
187 self.text = self.text + c |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
188 except UnicodeEncodeError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
189 # print a space as a placeholder |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
190 self.text = self.text + " " |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
191 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
192 def html2text(html): |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
193 parser = DumbHTMLParser( |
|
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
194 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
195 parser.feed(html) |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
196 parser.close() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
197 return parser.text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
198 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
199 self.html2text = html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
200 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
201 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
202 if __name__ == "__main__": |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
203 # ruff: noqa: B011 S101 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
204 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
205 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
206 assert False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
207 except AssertionError: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
208 pass |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
209 else: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
210 print("Error, assertions turned off. Test fails") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
211 sys.exit(1) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
212 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
213 html = """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
214 <body> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
215 <script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
216 this must not be in output |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
217 </script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
218 <style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
219 p {display:block} |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
220 </style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
221 <div class="header"><h1>Roundup</h1> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
222 <div id="searchbox" style="display: none"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
223 <form class="search" action="../search.html" method="get"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
224 <input type="text" name="q" size="18" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
225 <input type="submit" value="Search" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
226 <input type="hidden" name="check_keywords" value="yes" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
227 <input type="hidden" name="area" value="default" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
228 </form> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
229 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
230 <script type="text/javascript">$('#searchbox').show(0);</script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
231 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
232 <ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
233 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
234 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
235 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
236 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
237 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
238 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
239 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
240 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
241 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
242 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
243 </ul> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
244 <div class="section" id="prerequisites"> |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
245 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
246 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
247 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>. |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
248 It is highly recommended that users install the <span>latest patch version</span> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
249 of python as these contain many fixes to serious bugs.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
250 <p>Some variants of Linux will need an additional “python dev” package |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
251 installed for Roundup installation to work. Debian and derivatives, are |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
252 known to require this.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
253 <p>If you’re on windows, you will either need to be using the ActiveState python |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
254 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you’ll |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
255 have to install the win32all package separately (get it from |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
256 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p> |
|
5838
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
257 <script> |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
258 < HELP > |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
259 </script> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
260 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
261 </body> |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
262 """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
263 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
264 if len(sys.argv) > 1: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
265 with open(sys.argv[1]) as h: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
266 html = h.read() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
267 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
268 print("==== beautifulsoup") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
269 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
270 # trap error seen if N_TOKENS not defined when run. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
271 html2text = dehtml("beautifulsoup").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
272 if html2text: |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
273 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
274 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
275 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
276 print(text) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
277 except NameError as e: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
278 print("captured error %s" % e) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
279 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
280 print("==== justhtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
281 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
282 html2text = dehtml("justhtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
283 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
284 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
285 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
286 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
287 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
288 except NameError as e: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
289 print("captured error %s" % e) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
290 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
291 print("==== dehtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
292 html2text = dehtml("dehtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
293 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
294 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
295 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
296 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
297 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
298 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
299 print("==== disabled html -> text conversion") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
300 html2text = dehtml("none").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
301 if html2text: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
302 print("FAIL: Error, dehtml(none) is returning a function") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
303 else: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
304 print("PASS: dehtml(none) is returning None") |
