http://hg.code.sf.net:8000/p/roundup/code/atom-log/tip/roundup/dehtml.py Mercurial Repository: p/roundup/code: roundup/dehtml.py history 2026-04-08T21:39:40-04:00 chore: remove __future print_funcion from code. http://hg.code.sf.net:8000/p/roundup/code/#changeset-9c3ec0a5c7fc88acb8a65632ecc13b2d52380314 John Rouillard rouilj@ieee.org 2026-04-08T21:39:40-04:00 2026-04-08T21:39:40-04:00
changeset 9c3ec0a5c7fc
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description chore: remove __future print_funcion from code.

Not needed as of Python 3.
files
feat: support justhtml parsing library to convert email to plain text http://hg.code.sf.net:8000/p/roundup/code/#changeset-520075b29474aa5dd2586b2e9393ae95b47b0911 John Rouillard rouilj@ieee.org 2025-12-14T22:40:46-05:00 2025-12-14T22:40:46-05:00
changeset 520075b29474
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description feat: support justhtml parsing library to convert email to plain text

justhtml is an pure python, fast, HTML5 compliant parser. It is now an
option for converting html only emails to plain text. Its output
format differs slightly from dehtml or beautifulsoup. Mostly by
removing extra blank lines.

dehtml.py:
Using the stream parser of justhtml. Unable to get the full
document parser to successfully strip script and style blocks.

If I can fix this and use the standard parser, I can in theory
generate markdown from the DOM tree generated by justhtml.

Updated test case to include inline elements that should not cause a
line break when they are encountered. Running dehtml as: `python
roundup/dehtml.py foo.html` will load foo.html and parse it using
all available parsers.

configuration.py: justhtml is available as an option.

docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup
and justhtml to the optional software section of doc/installtion.txt.

test_mailgw.py, .github/workflows/ci-test Updated tests and install
justhtml as part of CI.
files
chore(lint): use ternary, ignore unused param http://hg.code.sf.net:8000/p/roundup/code/#changeset-b68a1d8fd5d95cae8f5a624c750641813445c5d9 John Rouillard rouilj@ieee.org 2024-03-24T15:25:53-04:00 2024-03-24T15:25:53-04:00
changeset b68a1d8fd5d9
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description chore(lint): use ternary, ignore unused param
files
chore(lint): doublequote strings, no yoda conitionals, sort imports... http://hg.code.sf.net:8000/p/roundup/code/#changeset-6079440ac02318b44271abcf3579bf466c4c69b0 John Rouillard rouilj@ieee.org 2024-03-01T16:12:21-05:00 2024-03-01T16:12:21-05:00
changeset 6079440ac023
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description chore(lint): doublequote strings, no yoda conitionals, sort imports...
files
flake8 fixes: whitespace, remove unused imports http://hg.code.sf.net:8000/p/roundup/code/#changeset-07ce4e4110f587bf6a878f9c15a9cb22537969ed John Rouillard rouilj@ieee.org 2023-03-18T14:16:31-04:00 2023-03-18T14:16:31-04:00
changeset 07ce4e4110f5
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description flake8 fixes: whitespace, remove unused imports
files
Explicitly set parser when calling beautiful soup. http://hg.code.sf.net:8000/p/roundup/code/#changeset-ef0975b4291b02bd80268856126936dac55b8337 John Rouillard rouilj@ieee.org 2022-05-09T23:15:34-04:00 2022-05-09T23:15:34-04:00
changeset ef0975b4291b
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description Explicitly set parser when calling beautiful soup.

Quiets warning in to be committed tests.
files
don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality) http://hg.code.sf.net:8000/p/roundup/code/#changeset-af81e7a4302fce69b6f50ea7e8ca7bdcc6e2dd26 Christof Meerwald cmeerw@cmeerw.org 2020-02-28T08:48:51+00:00 2020-02-28T08:48:51+00:00
changeset af81e7a4302f
branch
bookmark
tag
user Christof Meerwald <cmeerw@cmeerw.org>
description don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
files
flake8 cleanups dehtml.py http://hg.code.sf.net:8000/p/roundup/code/#changeset-1700542408f3df5b595bdf8638a2e393489e9e9e John Rouillard rouilj@ieee.org 2019-12-25T20:18:39-05:00 2019-12-25T20:18:39-05:00
changeset 1700542408f3
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description flake8 cleanups dehtml.py

Note you need to disable long lines as there is a test example that
requires really long lines of htmlized output.
files
Fix CI deprication warning for HTMLParser convert_charrefs under py3. http://hg.code.sf.net:8000/p/roundup/code/#changeset-b74f0b50bef178778ab8e6315cf8c7cea810a71d John Rouillard rouilj@ieee.org 2019-07-06T17:36:25-04:00 2019-07-06T17:36:25-04:00
changeset b74f0b50bef1
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description Fix CI deprication warning for HTMLParser convert_charrefs under py3.

/home/travis/build/roundup-tracker/roundup/roundup/dehtml.py:81:
DeprecationWarning: The value of convert_charrefs will become True in
3.5. You are encouraged to set the value explicitly.
parser = DumbHTMLParser()
files
Python 3 preparation: unichr. http://hg.code.sf.net:8000/p/roundup/code/#changeset-c749d6795bc2a47bde01f8b3b7eb506d7d5c94ed Joseph Myers jsm@polyomino.org.uk 2018-07-25T09:07:03+00:00 2018-07-25T09:07:03+00:00
changeset c749d6795bc2
branch
bookmark
tag
user Joseph Myers <jsm@polyomino.org.uk>
description Python 3 preparation: unichr.
files
Python 3 preparation: unicode. http://hg.code.sf.net:8000/p/roundup/code/#changeset-56c9bcdea47f22412e4f0768775d1abea52d19c2 Joseph Myers jsm@polyomino.org.uk 2018-07-25T09:05:58+00:00 2018-07-25T09:05:58+00:00
changeset 56c9bcdea47f
branch
bookmark
tag
user Joseph Myers <jsm@polyomino.org.uk>
description Python 3 preparation: unicode.

This patch introduces roundup/anypy/strings.py, which has a comment
explaining the string representations generally used and common
functions to handle the required conversions. Places in the code that
explicitly reference the "unicode" type / built-in function are
generally changed to use the new functions (or, in a few places where
those new functions don't seem to fit well, other approaches such as
references to type(u'') or use of the codecs module). This patch does
not generally attempt to address text conversions in any places not
currently referencing the "unicode" type (although
scripts/import_sf.py is made to use binary I/O in places as fixing the
"unicode" reference didn't seem coherent otherwise).
files
Python 3 preparation: update HTMLParser / htmlentitydefs imports. http://hg.code.sf.net:8000/p/roundup/code/#changeset-9c6d98bf79dbb7a22a4a95c5bfb9d0acc7416b1e Joseph Myers jsm@polyomino.org.uk 2018-07-25T00:35:49+00:00 2018-07-25T00:35:49+00:00
changeset 9c6d98bf79db
branch
bookmark
tag
user Joseph Myers <jsm@polyomino.org.uk>
description Python 3 preparation: update HTMLParser / htmlentitydefs imports.

Manual patch.
files
Python 3 preparation: convert print to a function. http://hg.code.sf.net:8000/p/roundup/code/#changeset-64b05e24dbd889f52bf8f773d3456bd0135baa27 Joseph Myers jsm@polyomino.org.uk 2018-07-24T09:54:52+00:00 2018-07-24T09:54:52+00:00
changeset 64b05e24dbd8
branch
bookmark
tag
user Joseph Myers <jsm@polyomino.org.uk>
description Python 3 preparation: convert print to a function.

Tool-assisted patch. It is possible that some "from __future__ import
print_function" are not in fact needed, if a file only uses print()
with a single string as an argument and so would work fine in Python 2
without that import.
files
issue2550799: provide basic support for handling html only emails http://hg.code.sf.net:8000/p/roundup/code/#changeset-e20f472fde7d9ed7e1e5120de2941181efdaeba7 John Rouillard rouilj@ieee.org 2017-10-13T21:46:59-04:00 2017-10-13T21:46:59-04:00
changeset e20f472fde7d
branch
bookmark
tag
user John Rouillard <rouilj@ieee.org>
description issue2550799: provide basic support for handling html only emails

Initial implementation and testing with the dehtml html converter
done.

The use of beautifulsoup 4 is not tested. My test system breaks when
running dehtml.py using beautiful soup. I don't get the failures when
running under the test harness, but the text output is significantly
different (different line breaks, number of newlines etc.)

The tests for dehtml need to be generated for beautiful soup and the
expected output changed. Since I have a wonky install of beautiful
soup, I don't trust my output as the standard to test against. Also
since beautiful soup is optional, the test harness needs to skip the
beautifulsoup tests if import bs4 fails. Again something outside of my
expertise. I deleted the work I had done to implement that. I could
not get it working and wanted to get this feature in in some form.
files