Mercurial > p > roundup > code
annotate roundup/dehtml.py @ 8491:520075b29474
feat: support justhtml parsing library to convert email to plain text
justhtml is an pure python, fast, HTML5 compliant parser. It is now an
option for converting html only emails to plain text. Its output
format differs slightly from dehtml or beautifulsoup. Mostly by
removing extra blank lines.
dehtml.py:
Using the stream parser of justhtml. Unable to get the full
document parser to successfully strip script and style blocks.
If I can fix this and use the standard parser, I can in theory
generate markdown from the DOM tree generated by justhtml.
Updated test case to include inline elements that should not cause a
line break when they are encountered. Running dehtml as: `python
roundup/dehtml.py foo.html` will load foo.html and parse it using
all available parsers.
configuration.py: justhtml is available as an option.
docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup
and justhtml to the optional software section of doc/installtion.txt.
test_mailgw.py, .github/workflows/ci-test Updated tests and install
justhtml as part of CI.
| author | John Rouillard <rouilj@ieee.org> |
|---|---|
| date | Sun, 14 Dec 2025 22:40:46 -0500 |
| parents | b68a1d8fd5d9 |
| children | 9c3ec0a5c7fc |
| rev | line source |
|---|---|
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
1 |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
2 from __future__ import print_function |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
3 |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
4 import sys |
|
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
5 |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
6 from roundup.anypy.strings import u2s, uchr |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
7 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
8 # ruff PLC0415 ignore imports not at top of file |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
9 # ruff RET505 ignore else after return |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
10 # ruff: noqa: PLC0415 RET505 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
11 |
|
6110
af81e7a4302f
don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
Christof Meerwald <cmeerw@cmeerw.org>
parents:
5997
diff
changeset
|
12 _pyver = sys.version_info[0] |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
13 |
|
7228
07ce4e4110f5
flake8 fixes: whitespace, remove unused imports
John Rouillard <rouilj@ieee.org>
parents:
6669
diff
changeset
|
14 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
15 class dehtml: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
16 def __init__(self, converter): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
17 if converter == "none": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
18 self.html2text = None |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
19 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
20 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
21 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
22 if converter == "beautifulsoup": |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
23 # Not as well tested as dehtml. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
24 from bs4 import BeautifulSoup |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
25 |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
26 def html2text(html): |
|
6669
ef0975b4291b
Explicitly set parser when calling beautiful soup.
John Rouillard <rouilj@ieee.org>
parents:
6110
diff
changeset
|
27 soup = BeautifulSoup(html, "html.parser") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
28 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
29 # kill all script and style elements |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
30 for script in soup(["script", "style"]): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
31 script.extract() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
32 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
33 return u2s(soup.get_text("\n", strip=True)) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
34 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
35 self.html2text = html2text |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
36 elif converter == "justhtml": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
37 from justhtml import stream |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
38 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
39 def html2text(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
40 # The below does not work. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
41 # Using stream parser since I couldn't seem to strip |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
42 # 'script' and 'style' blocks. But stream doesn't |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
43 # have error reporting or stripping of text nodes |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
44 # and dropping empty nodes. Also I would like to try |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
45 # its GFM markdown output too even though it keeps |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
46 # tables as html and doesn't completely covert as |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
47 # this would work well for those supporting markdown. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
48 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
49 # ctx used for for testing since I have a truncated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
50 # test doc. It eliminates error from missing DOCTYPE |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
51 # and head. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
52 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
53 #from justhtml import JustHTML |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
54 # from justhtml.context import FragmentContext |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
55 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
56 #ctx = FragmentContext('html') |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
57 #justhtml = JustHTML(html,collect_errors=True, |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
58 # fragment_context=ctx) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
59 # I still have the text output inside style/script tags. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
60 # with :not(style, script). I do get text contents |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
61 # with query("style, script"). |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
62 # |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
63 #return u2s("\n".join( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
64 # [elem.to_text(separator="\n", strip=True) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
65 # for elem in justhtml.query(":not(style, script)")]) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
66 # ) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
67 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
68 # define inline elements so I can accumulate all unbroken |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
69 # text in a single line with embedded inline elements. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
70 # 'br' is inline but should be treated it as a line break |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
71 # and element before/after should not be accumulated |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
72 # together. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
73 inline_elements = ( |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
74 "a", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
75 "address", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
76 "b", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
77 "cite", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
78 "code", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
79 "em", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
80 "i", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
81 "img", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
82 "mark", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
83 "q", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
84 "s", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
85 "small", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
86 "span", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
87 "strong", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
88 "sub", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
89 "sup", |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
90 "time") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
91 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
92 # each line is appended and joined at the end |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
93 text = [] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
94 # the accumulator for all text in inline elements |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
95 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
96 # if set skip all lines till matching end tag found |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
97 # used to skip script/style blocks |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
98 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
99 # used to force text_accumulator into text with added |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
100 # newline so we have a blank line between paragraphs. |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
101 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
102 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
103 for event, data in stream(html): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
104 if event == "end" and skip_till_endtag == data: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
105 skip_till_endtag = None |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
106 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
107 if skip_till_endtag: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
108 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
109 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
110 data[0] in ('script', 'style')): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
111 skip_till_endtag = data[0] |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
112 continue |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
113 if (event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
114 text_accumulator and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
115 data[0] not in inline_elements): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
116 # add accumulator to "text" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
117 text.append(text_accumulator) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
118 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
119 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
120 elif event == "text": |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
121 if not data.isspace(): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
122 text_accumulator = text_accumulator + data |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
123 _need_parabreak = True |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
124 elif (_need_parabreak and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
125 event == "start" and |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
126 data[0] == "p"): |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
127 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
128 text_accumulator = "" |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
129 _need_parabreak = False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
130 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
131 # save anything left in the accumulator at end of document |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
132 if text_accumulator: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
133 # add newline to match dehtml and beautifulsoup |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
134 text.append(text_accumulator + "\n") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
135 return u2s("\n".join(text)) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
136 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
137 self.html2text = html2text |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
138 else: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
139 raise ImportError |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
140 except ImportError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
141 # use the fallback below if beautiful soup is not installed. |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
142 try: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
143 # Python 3+. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
144 from html.entities import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
145 from html.parser import HTMLParser |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
146 except ImportError: |
|
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
147 # Python 2. |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
148 from htmlentitydefs import name2codepoint |
|
5411
9c6d98bf79db
Python 3 preparation: update HTMLParser / htmlentitydefs imports.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5376
diff
changeset
|
149 from HTMLParser import HTMLParser |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
150 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
151 class DumbHTMLParser(HTMLParser): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
152 # class attribute |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
153 text = "" |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
154 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
155 # internal state variable |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
156 _skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
157 _last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
158 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
159 def handle_data(self, data): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
160 if self._skip_data: # skip data in script or style block |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
161 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
162 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
163 if (data.strip() == ""): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
164 # reduce multiple blank lines to 1 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
165 if (self._last_empty): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
166 return |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
167 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
168 self._last_empty = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
169 else: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
170 self._last_empty = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
171 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
172 self.text = self.text + data |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
173 |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
174 def handle_starttag(self, tag, attrs): # noqa: ARG002 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
175 if (tag == "p"): |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
176 self.text = self.text + "\n" |
|
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
177 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
178 self._skip_data = True |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
179 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
180 def handle_endtag(self, tag): |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
181 if (tag in ("style", "script")): |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
182 self._skip_data = False |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
183 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
184 def handle_entityref(self, name): |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
185 if self._skip_data: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
186 return |
|
5417
c749d6795bc2
Python 3 preparation: unichr.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5416
diff
changeset
|
187 c = uchr(name2codepoint[name]) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
188 try: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
189 self.text = self.text + c |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
190 except UnicodeEncodeError: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
191 # print a space as a placeholder |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
192 self.text = self.text + " " |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
193 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
194 def html2text(html): |
|
7833
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
195 parser = DumbHTMLParser( |
|
b68a1d8fd5d9
chore(lint): use ternary, ignore unused param
John Rouillard <rouilj@ieee.org>
parents:
7756
diff
changeset
|
196 convert_charrefs=True) if _pyver == 3 else DumbHTMLParser() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
197 parser.feed(html) |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
198 parser.close() |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
199 return parser.text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
200 |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
201 self.html2text = html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
202 |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
203 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
204 if __name__ == "__main__": |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
205 # ruff: noqa: B011 S101 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
206 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
207 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
208 assert False |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
209 except AssertionError: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
210 pass |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
211 else: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
212 print("Error, assertions turned off. Test fails") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
213 sys.exit(1) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
214 |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
215 html = """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
216 <body> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
217 <script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
218 this must not be in output |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
219 </script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
220 <style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
221 p {display:block} |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
222 </style> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
223 <div class="header"><h1>Roundup</h1> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
224 <div id="searchbox" style="display: none"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
225 <form class="search" action="../search.html" method="get"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
226 <input type="text" name="q" size="18" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
227 <input type="submit" value="Search" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
228 <input type="hidden" name="check_keywords" value="yes" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
229 <input type="hidden" name="area" value="default" /> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
230 </form> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
231 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
232 <script type="text/javascript">$('#searchbox').show(0);</script> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
233 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
234 <ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
235 <li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
236 <li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
237 <li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current"> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
238 <li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
239 <li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
240 <li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
241 <li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
242 <li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
243 <li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
244 <li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
245 </ul> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
246 <div class="section" id="prerequisites"> |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
247 <H2><a class="toc-backref" href="#id5">Prerequisites</a></H2> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
248 <p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
249 anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>. |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
250 It is highly recommended that users install the <span>latest patch version</span> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
251 of python as these contain many fixes to serious bugs.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
252 <p>Some variants of Linux will need an additional “python dev” package |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
253 installed for Roundup installation to work. Debian and derivatives, are |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
254 known to require this.</p> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
255 <p>If you’re on windows, you will either need to be using the ActiveState python |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
256 distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you’ll |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
257 have to install the win32all package separately (get it from |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
258 <a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p> |
|
5838
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
259 <script> |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
260 < HELP > |
|
b74f0b50bef1
Fix CI deprication warning for HTMLParser convert_charrefs under py3.
John Rouillard <rouilj@ieee.org>
parents:
5417
diff
changeset
|
261 </script> |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
262 </div> |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
263 </body> |
|
7756
6079440ac023
chore(lint): doublequote strings, no yoda conitionals, sort imports...
John Rouillard <rouilj@ieee.org>
parents:
7228
diff
changeset
|
264 """ |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
265 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
266 if len(sys.argv) > 1: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
267 with open(sys.argv[1]) as h: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
268 html = h.read() |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
269 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
270 print("==== beautifulsoup") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
271 try: |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
272 # trap error seen if N_TOKENS not defined when run. |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
273 html2text = dehtml("beautifulsoup").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
274 if html2text: |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
275 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
276 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
277 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
278 print(text) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
279 except NameError as e: |
|
5997
1700542408f3
flake8 cleanups dehtml.py
John Rouillard <rouilj@ieee.org>
parents:
5838
diff
changeset
|
280 print("captured error %s" % e) |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
281 |
|
8491
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
282 print("==== justhtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
283 try: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
284 html2text = dehtml("justhtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
285 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
286 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
287 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
288 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
289 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
290 except NameError as e: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
291 print("captured error %s" % e) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
292 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
293 print("==== dehtml") |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
294 html2text = dehtml("dehtml").html2text |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
295 if html2text: |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
296 text = html2text(html) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
297 assert ('HELP' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
298 assert ('display:block' not in text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
299 print(text) |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
300 |
|
520075b29474
feat: support justhtml parsing library to convert email to plain text
John Rouillard <rouilj@ieee.org>
parents:
7833
diff
changeset
|
301 print("==== disabled html -> text conversion") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
302 html2text = dehtml("none").html2text |
|
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
303 if html2text: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
304 print("FAIL: Error, dehtml(none) is returning a function") |
|
5305
e20f472fde7d
issue2550799: provide basic support for handling html only emails
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
305 else: |
|
5376
64b05e24dbd8
Python 3 preparation: convert print to a function.
Joseph Myers <jsm@polyomino.org.uk>
parents:
5305
diff
changeset
|
306 print("PASS: dehtml(none) is returning None") |
