Mercurial > p > roundup > code
annotate roundup/dehtml.py @ 8527:d4a43d9da8ef
chore(build): build(deps): bump anchore/scan-action from 7.3.1 to 7.3.2 pull #82
| author | John Rouillard <rouilj@ieee.org> |
|---|---|
| date | Mon, 23 Feb 2026 20:16:55 -0500 |
| parents | 520075b29474 |
| children | 9c3ec0a5c7fc |
| rev | line source |
|---|---|
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
1 |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
2 from __future__ import print_function |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
3 |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
4 import sys |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
5 |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
6 from roundup.anypy.strings import u2s, uchr |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
7 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
8 # ruff PLC0415 ignore imports not at top of file |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
9 # ruff RET505 ignore else after return |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
10 # ruff: noqa: PLC0415 RET505 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
11 |
|
6110
af81e7a4302f
don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents:
5997
diff
changeset
|
12 _pyver = sys.version_info[0] |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
13 |
|
7228
07ce4e4110f5
flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents:
6669
diff
changeset
|
14 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
15 class dehtml: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
16 def __init__(self, converter): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
17 if converter == "none": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
18 self.html2text = None |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
19 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
20 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
21 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
22 if converter == "beautifulsoup": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
23 # Not as well tested as dehtml. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
24 from bs4 import BeautifulSoup |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
25 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
26 def html2text(html): |
|
6669
ef0975b4291b
Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents:
6110
diff
changeset
|
27 soup = BeautifulSoup(html, "html.parser") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
28 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
29 # kill all script and style elements |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
30 for script in soup(["script", "style"]): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
31 script.extract() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
32 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
33 return u2s(soup.get_text("\n", strip=True)) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
34 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
35 self.html2text = html2text |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
36 elif converter == "justhtml": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
37 from justhtml import stream |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
38 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
39 def html2text(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
40 # The below does not work. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
41 # Using stream parser since I couldn't seem to strip |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
42 # 'script' and 'style' blocks. But stream doesn't |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
43 # have error reporting or stripping of text nodes |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
44 # and dropping empty nodes. Also I would like to try |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
45 # its GFM markdown output too even though it keeps |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
46 # tables as html and doesn't completely covert as |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
47 # this would work well for those supporting markdown. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
48 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
49 # ctx used for for testing since I have a truncated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
50 # test doc. It eliminates error from missing DOCTYPE |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
51 # and head. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
52 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
53 #from justhtml import JustHTML |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
54 # from justhtml.context import FragmentContext |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
55 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
56 #ctx = FragmentContext('html') |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
57 #justhtml = JustHTML(html,collect_errors=True, |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
58 # fragment_context=ctx) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
59 # I still have the text output inside style/script tags. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
60 # with :not(style, script). I do get text contents |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
61 # with query("style, script"). |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
62 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
63 #return u2s("\n".join( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
64 # [elem.to_text(separator="\n", strip=True) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
65 # for elem in justhtml.query(":not(style, script)")]) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
66 # ) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
67 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
68 # define inline elements so I can accumulate all unbroken |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
69 # text in a single line with embedded inline elements. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
70 # 'br' is inline but should be treated it as a line break |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
71 # and element before/after should not be accumulated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
72 # together. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
73 inline_elements = ( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
74 "a", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
75 "address", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
76 "b", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
77 "cite", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
78 "code", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
79 "em", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
80 "i", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
81 "img", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
82 "mark", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
83 "q", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
84 "s", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
85 "small", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
86 "span", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
87 "strong", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
88 "sub", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
89 "sup", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
90 "time") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
91 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
92 # each line is appended and joined at the end |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
93 text = [] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
94 # the accumulator for all text in inline elements |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
95 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
96 # if set skip all lines till matching end tag found |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
97 # used to skip script/style blocks |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
98 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
99 # used to force text_accumulator into text with added |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
100 # newline so we have a blank line between paragraphs. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
101 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
102 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
103 for event, data in stream(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
104 if event == "end" and skip_till_endtag == data: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
105 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
106 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
107 if skip_till_endtag: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
108 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
109 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
110 data[0] in ('script', 'style')): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
111 skip_till_endtag = data[0] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
112 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
113 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
114 text_accumulator and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
115 data[0] not in inline_elements): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
116 # add accumulator to "text" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
117 text.append(text_accumulator) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
118 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
119 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
120 elif event == "text": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
121 if not data.isspace(): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
122 text_accumulator = text_accumulator + data |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
123 _need_parabreak = True |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
124 elif (_need_parabreak and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
125 event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
126 data[0] == "p"): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
127 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
128 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
129 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
130 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
131 # save anything left in the accumulator at end of document |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
132 if text_accumulator: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
133 # add newline to match dehtml and beautifulsoup |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
134 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
135 return u2s("\n".join(text)) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
136 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
137 self.html2text = html2text |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
138 else: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
139 raise ImportError |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
140 except ImportError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
141 # use the fallback below if beautiful soup is not installed. |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
142 try: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
143 # Python 3+. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
144 from html.entities import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
145 from html.parser import HTMLParser |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
146 except ImportError: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
147 # Python 2. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
148 from htmlentitydefs import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
149 from HTMLParser import HTMLParser |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
150 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
151 class DumbHTMLParser(HTMLParser): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
152 # class attribute |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
153 text = "" |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
154 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
155 # internal state variable |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
156 _skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
157 _last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
158 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
159 def handle_data(self, data): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
160 if self._skip_data: # skip data in script or style block |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
161 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
162 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
163 if (data.strip() == ""): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
164 # reduce multiple blank lines to 1 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
165 if (self._last_empty): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
166 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
167 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
168 self._last_empty = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
169 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
170 self._last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
171 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
172 self.text = self.text + data |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
173 |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
174 def handle_starttag(self, tag, attrs): # noqa: ARG002 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
175 if (tag == "p"): |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
176 self.text = self.text + "\n" |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
177 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
178 self._skip_data = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
179 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
180 def handle_endtag(self, tag): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
181 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
182 self._skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
183 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
184 def handle_entityref(self, name): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
185 if self._skip_data: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
186 return |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
187 c = uchr(name2codepoint[name]) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
188 try: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
189 self.text = self.text + c |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
190 except UnicodeEncodeError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
191 # print a space as a placeholder |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
192 self.text = self.text + " " |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
193 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
194 def html2text(html): |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
195 parser = DumbHTMLParser( |
|
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
196 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
197 parser.feed(html) |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
198 parser.close() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
199 return parser.text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
200 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
201 self.html2text = html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
202 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
203 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
204 if __name__ == "__main__": |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
205 # ruff: noqa: B011 S101 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
206 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
207 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
208 assert False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
209 except AssertionError: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
210 pass |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
211 else: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
212 print("Error, assertions turned off. Test fails") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
213 sys.exit(1) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
214 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
215 html = """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
216 <body> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
217 <script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
218 this must not be in output |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
219 </script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
220 <style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
221 p {display:block} |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
222 </style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
223 <div class="header"><h1>Roundup</h1> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
224 <div id="searchbox" style="display: none"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
225 <form class="search" action="../search.html" method="get"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
226 <input type="text" name="q" size="18" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
227 <input type="submit" value="Search" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
228 <input type="hidden" name="check_keywords" value="yes" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
229 <input type="hidden" name="area" value="default" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
230 </form> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
231 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
232 <script type="text/javascript">$('#searchbox').show(0);</script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
233 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
234 <ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
235 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
236 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
237 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
238 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
239 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
240 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
241 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
242 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
243 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
244 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
245 </ul> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
246 <div class="section" id="prerequisites"> |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
247 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
248 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
249 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>. |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
250 It is highly recommended that users install the <span>latest patch version</span> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
251 of python as these contain many fixes to serious bugs.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
252 <p>Some variants of Linux will need an additional “python dev” package |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
253 installed for Roundup installation to work. Debian and derivatives, are |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
254 known to require this.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
255 <p>If you’re on windows, you will either need to be using the ActiveState python |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
256 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you’ll |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
257 have to install the win32all package separately (get it from |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
258 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p> |
|
5838
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
259 <script> |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
260 < HELP > |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
261 </script> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
262 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
263 </body> |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
264 """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
265 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
266 if len(sys.argv) > 1: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
267 with open(sys.argv[1]) as h: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
268 html = h.read() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
269 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
270 print("==== beautifulsoup") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
271 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
272 # trap error seen if N_TOKENS not defined when run. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
273 html2text = dehtml("beautifulsoup").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
274 if html2text: |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
275 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
276 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
277 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
278 print(text) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
279 except NameError as e: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
280 print("captured error %s" % e) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
281 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
282 print("==== justhtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
283 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
284 html2text = dehtml("justhtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
285 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
286 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
287 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
288 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
289 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
290 except NameError as e: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
291 print("captured error %s" % e) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
292 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
293 print("==== dehtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
294 html2text = dehtml("dehtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
295 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
296 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
297 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
298 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
299 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
300 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
301 print("==== disabled html -> text conversion") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
302 html2text = dehtml("none").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
303 if html2text: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
304 print("FAIL: Error, dehtml(none) is returning a function") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
305 else: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
306 print("PASS: dehtml(none) is returning None") |
