Mercurial Repository: p/roundup/code: roundup/dehtml.py history

http://hg.code.sf.net:8000/p/roundup/code/atom-log/tip/roundup/dehtml.py Mercurial Repository: p/roundup/code: roundup/dehtml.py history 2026-04-08T21:39:40-04:00 chore: remove __future print_funcion from code. http://hg.code.sf.net:8000/p/roundup/code/#changeset-9c3ec0a5c7fc88acb8a65632ecc13b2d52380314 John Rouillard rouilj@ieee.org 2026-04-08T21:39:40-04:00 2026-04-08T21:39:40-04:00

changeset	9c3ec0a5c7fc
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	chore: remove __future print_funcion from code. Not needed as of Python 3.
files

feat: support justhtml parsing library to convert email to plain text http://hg.code.sf.net:8000/p/roundup/code/#changeset-520075b29474aa5dd2586b2e9393ae95b47b0911 John Rouillard rouilj@ieee.org 2025-12-14T22:40:46-05:00 2025-12-14T22:40:46-05:00

changeset	520075b29474
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	feat: support justhtml parsing library to convert email to plain text justhtml is an pure python, fast, HTML5 compliant parser. It is now an option for converting html only emails to plain text. Its output format differs slightly from dehtml or beautifulsoup. Mostly by removing extra blank lines. dehtml.py: Using the stream parser of justhtml. Unable to get the full document parser to successfully strip script and style blocks. If I can fix this and use the standard parser, I can in theory generate markdown from the DOM tree generated by justhtml. Updated test case to include inline elements that should not cause a line break when they are encountered. Running dehtml as: `python roundup/dehtml.py foo.html` will load foo.html and parse it using all available parsers. configuration.py: justhtml is available as an option. docs: updated CHANGES.txt, doc/tracker_config.txt added beautifulsoup and justhtml to the optional software section of doc/installtion.txt. test_mailgw.py, .github/workflows/ci-test Updated tests and install justhtml as part of CI.
files

chore(lint): use ternary, ignore unused param http://hg.code.sf.net:8000/p/roundup/code/#changeset-b68a1d8fd5d95cae8f5a624c750641813445c5d9 John Rouillard rouilj@ieee.org 2024-03-24T15:25:53-04:00 2024-03-24T15:25:53-04:00

changeset	b68a1d8fd5d9
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	chore(lint): use ternary, ignore unused param
files

chore(lint): doublequote strings, no yoda conitionals, sort imports... http://hg.code.sf.net:8000/p/roundup/code/#changeset-6079440ac02318b44271abcf3579bf466c4c69b0 John Rouillard rouilj@ieee.org 2024-03-01T16:12:21-05:00 2024-03-01T16:12:21-05:00

changeset	6079440ac023
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	chore(lint): doublequote strings, no yoda conitionals, sort imports...
files

flake8 fixes: whitespace, remove unused imports http://hg.code.sf.net:8000/p/roundup/code/#changeset-07ce4e4110f587bf6a878f9c15a9cb22537969ed John Rouillard rouilj@ieee.org 2023-03-18T14:16:31-04:00 2023-03-18T14:16:31-04:00

changeset	07ce4e4110f5
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	flake8 fixes: whitespace, remove unused imports
files

Explicitly set parser when calling beautiful soup. http://hg.code.sf.net:8000/p/roundup/code/#changeset-ef0975b4291b02bd80268856126936dac55b8337 John Rouillard rouilj@ieee.org 2022-05-09T23:15:34-04:00 2022-05-09T23:15:34-04:00

changeset	ef0975b4291b
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	Explicitly set parser when calling beautiful soup. Quiets warning in to be committed tests.
files

don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality) http://hg.code.sf.net:8000/p/roundup/code/#changeset-af81e7a4302fce69b6f50ea7e8ca7bdcc6e2dd26 Christof Meerwald cmeerw@cmeerw.org 2020-02-28T08:48:51+00:00 2020-02-28T08:48:51+00:00

changeset	af81e7a4302f
branch
bookmark
tag
user	Christof Meerwald <cmeerw@cmeerw.org>
description	don't get confused by python-future making Python 3 package names available under Python 2 (but only with Python 2 functionality)
files

flake8 cleanups dehtml.py http://hg.code.sf.net:8000/p/roundup/code/#changeset-1700542408f3df5b595bdf8638a2e393489e9e9e John Rouillard rouilj@ieee.org 2019-12-25T20:18:39-05:00 2019-12-25T20:18:39-05:00

changeset	1700542408f3
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	flake8 cleanups dehtml.py Note you need to disable long lines as there is a test example that requires really long lines of htmlized output.
files

Fix CI deprication warning for HTMLParser convert_charrefs under py3. http://hg.code.sf.net:8000/p/roundup/code/#changeset-b74f0b50bef178778ab8e6315cf8c7cea810a71d John Rouillard rouilj@ieee.org 2019-07-06T17:36:25-04:00 2019-07-06T17:36:25-04:00

changeset	b74f0b50bef1
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	Fix CI deprication warning for HTMLParser convert_charrefs under py3. /home/travis/build/roundup-tracker/roundup/roundup/dehtml.py:81: DeprecationWarning: The value of convert_charrefs will become True in 3.5. You are encouraged to set the value explicitly. parser = DumbHTMLParser()
files

Python 3 preparation: unichr. http://hg.code.sf.net:8000/p/roundup/code/#changeset-c749d6795bc2a47bde01f8b3b7eb506d7d5c94ed Joseph Myers jsm@polyomino.org.uk 2018-07-25T09:07:03+00:00 2018-07-25T09:07:03+00:00

changeset	c749d6795bc2
branch
bookmark
tag
user	Joseph Myers <jsm@polyomino.org.uk>
description	Python 3 preparation: unichr.
files

Python 3 preparation: unicode. http://hg.code.sf.net:8000/p/roundup/code/#changeset-56c9bcdea47f22412e4f0768775d1abea52d19c2 Joseph Myers jsm@polyomino.org.uk 2018-07-25T09:05:58+00:00 2018-07-25T09:05:58+00:00

changeset	56c9bcdea47f
branch
bookmark
tag
user	Joseph Myers <jsm@polyomino.org.uk>
description	Python 3 preparation: unicode. This patch introduces roundup/anypy/strings.py, which has a comment explaining the string representations generally used and common functions to handle the required conversions. Places in the code that explicitly reference the "unicode" type / built-in function are generally changed to use the new functions (or, in a few places where those new functions don't seem to fit well, other approaches such as references to type(u'') or use of the codecs module). This patch does not generally attempt to address text conversions in any places not currently referencing the "unicode" type (although scripts/import_sf.py is made to use binary I/O in places as fixing the "unicode" reference didn't seem coherent otherwise).
files

Python 3 preparation: update HTMLParser / htmlentitydefs imports. http://hg.code.sf.net:8000/p/roundup/code/#changeset-9c6d98bf79dbb7a22a4a95c5bfb9d0acc7416b1e Joseph Myers jsm@polyomino.org.uk 2018-07-25T00:35:49+00:00 2018-07-25T00:35:49+00:00

changeset	9c6d98bf79db
branch
bookmark
tag
user	Joseph Myers <jsm@polyomino.org.uk>
description	Python 3 preparation: update HTMLParser / htmlentitydefs imports. Manual patch.
files

Python 3 preparation: convert print to a function. http://hg.code.sf.net:8000/p/roundup/code/#changeset-64b05e24dbd889f52bf8f773d3456bd0135baa27 Joseph Myers jsm@polyomino.org.uk 2018-07-24T09:54:52+00:00 2018-07-24T09:54:52+00:00

changeset	64b05e24dbd8
branch
bookmark
tag
user	Joseph Myers <jsm@polyomino.org.uk>
description	Python 3 preparation: convert print to a function. Tool-assisted patch. It is possible that some "from __future__ import print_function" are not in fact needed, if a file only uses print() with a single string as an argument and so would work fine in Python 2 without that import.
files

issue2550799: provide basic support for handling html only emails http://hg.code.sf.net:8000/p/roundup/code/#changeset-e20f472fde7d9ed7e1e5120de2941181efdaeba7 John Rouillard rouilj@ieee.org 2017-10-13T21:46:59-04:00 2017-10-13T21:46:59-04:00

changeset	e20f472fde7d
branch
bookmark
tag
user	John Rouillard <rouilj@ieee.org>
description	issue2550799: provide basic support for handling html only emails Initial implementation and testing with the dehtml html converter done. The use of beautifulsoup 4 is not tested. My test system breaks when running dehtml.py using beautiful soup. I don't get the failures when running under the test harness, but the text output is significantly different (different line breaks, number of newlines etc.) The tests for dehtml need to be generated for beautiful soup and the expected output changed. Since I have a wonky install of beautiful soup, I don't trust my output as the standard to test against. Also since beautiful soup is optional, the test harness needs to skip the beautifulsoup tests if import bs4 fails. Again something outside of my expertise. I deleted the work I had done to implement that. I could not get it working and wanted to get this feature in in some form.
files