annotate test/html_norm.py @ 8531:4fe0d14cf915

chore(build): bump actions/upload-artifact from 6.0.0 to 7.0.0. #84
author John Rouillard <rouilj@ieee.org>
date Tue, 10 Mar 2026 22:52:54 -0400
parents 5cadcaa13bed
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
1 """Minimal html parser/normalizer for use in test_templating.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
2
7077
5bc36b65d06b fix typos in docstring.
John Rouillard <rouilj@ieee.org>
parents: 6996
diff changeset
3 When testing markdown -> html conversion libraries, there are
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
4 gratuitous whitespace changes in generated output that break the
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
5 tests. Use this to try to normalize the generated HTML into something
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
6 that tries to preserve the semantic meaning allowing tests to stop
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
7 breaking.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
8
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
9 This is not a complete parsing engine. It supports the Roundup issue
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
10 tracker unit tests so that no third party libraries are needed to run
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
11 the tests. If you find it useful enjoy.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
12
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
13 Ideally this would be done by hijacking in some way
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
14 lxml.html.usedoctest to get a liberal parser that will ignore
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
15 whitespace. But that means the user has to install lxml to run the
7077
5bc36b65d06b fix typos in docstring.
John Rouillard <rouilj@ieee.org>
parents: 6996
diff changeset
16 tests. Similarly BeautifulSoup could be used to pretty print the html
5bc36b65d06b fix typos in docstring.
John Rouillard <rouilj@ieee.org>
parents: 6996
diff changeset
17 but again, BeautifulSoup would need to be installed to run the
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
18 tests.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
19
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
20 """
6996
3546f23ea493 Add support for python2.
John Rouillard <rouilj@ieee.org>
parents: 6995
diff changeset
21 try:
3546f23ea493 Add support for python2.
John Rouillard <rouilj@ieee.org>
parents: 6995
diff changeset
22 from html.parser import HTMLParser
3546f23ea493 Add support for python2.
John Rouillard <rouilj@ieee.org>
parents: 6995
diff changeset
23 except ImportError:
3546f23ea493 Add support for python2.
John Rouillard <rouilj@ieee.org>
parents: 6995
diff changeset
24 from HTMLParser import HTMLParser # python2
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
25
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
26 try:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
27 from htmlentitydefs import name2codepoint
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
28 except ImportError:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
29 pass # assume running under python3, name2codepoint predefined
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
30
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
31
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
32 class NormalizingHtmlParser(HTMLParser):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
33 """Handle start/end tags and normalize whitespace in data.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
34 Strip doctype, comments when passed in.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
35
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
36 Implements normalize method that takes input html and returns a
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
37 normalized string leaving the instance ready for another call to
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
38 normalize for another string.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
39
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
40
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
41 Note that using this rewrites all attributes parsed by HTMLParser
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
42 into attr="value" form even though HTMLParser accepts other
7560
5cadcaa13bed prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents: 7077
diff changeset
43 attribute specification forms.
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
44 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
45
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
46 debug = False # set to true to enable more verbose output
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
47
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
48 current_normalized_string = "" # accumulate result string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
49 preserve_data = False # if inside pre preserve whitespace
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
50
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
51 def handle_starttag(self, tag, attrs):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
52 """put tag on new line with attributes.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
53 Note valid attributes according to HTMLParser:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
54 attrs='single_quote'
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
55 attrs=noquote
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
56 attrs="double_quote"
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
57 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
58 if self.debug: print("Start tag:", tag)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
59
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
60 self.current_normalized_string += "\n<%s" % tag
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
61
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
62 for attr in attrs:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
63 if self.debug: print(" attr:", attr)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
64 self.current_normalized_string += ' %s="%s"' % attr
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
65
7560
5cadcaa13bed prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents: 7077
diff changeset
66 self.current_normalized_string += ">\n"
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
67
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
68 if tag == 'pre':
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
69 self.preserve_data = True
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
70
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
71 def handle_endtag(self, tag):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
72 if self.debug: print("End tag :", tag)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
73
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
74 self.current_normalized_string += "\n</%s>" % tag
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
75
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
76 if tag == 'pre':
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
77 self.preserve_data = False
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
78
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
79 def handle_data(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
80 if self.debug: print("Data :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
81 if not self.preserve_data:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
82 # normalize whitespace remove leading/trailing
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
83 data = " ".join(data.strip().split())
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
84
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
85 if data:
7560
5cadcaa13bed prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents: 7077
diff changeset
86 self.current_normalized_string += "%s" % data
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
87
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
88 def handle_comment(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
89 print("Comment :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
90
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
91 def handle_decl(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
92 print("Decl :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
93
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
94 def reset(self):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
95 """wrapper around reset with clearing of csef.current_normalized_string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
96 and reset of self.preserve_data
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
97 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
98 HTMLParser.reset(self)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
99 self.current_normalized_string = ""
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
100 self.preserve_data = False
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
101
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
102 def normalize(self, html):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
103 self.feed(html)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
104 result = self.current_normalized_string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
105 self.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
106 return result
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
107
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
108
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
109 if __name__ == "__main__":
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
110 parser = NormalizingHtmlParser()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
111
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
112 parser.feed('<div class="markup"><p> paragraph text with whitespace\n and more space <pre><span class="f" data-attr="f">text more text</span></pre></div>')
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
113 print("\n\ntest1", parser.current_normalized_string)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
114
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
115 parser.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
116
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
117 parser.feed('''<div class="markup">
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
118 <p> paragraph text with whitespace\n and more space
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
119 <pre><span class="f" data-attr="f">text \n more text</span></pre>
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
120 </div>''')
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
121 print("\n\ntest2", parser.current_normalized_string)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
122 parser.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
123 print("\n\nnormalize", parser.normalize('''<div class="markup">
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
124 <p> paragraph text with whitespace\n and more space
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
125 <pre><span class="f" data-attr="f">text \n more text &lt;</span></pre>
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
126 </div>'''))

Roundup Issue Tracker: http://roundup-tracker.org/