annotate test/html_norm.py @ 6995:dc83ebff4c90

change test to use html normalizer when comparing html output. Update to Markdown2 parser changed text output keeping same html semantics. Broke test_string_markdown_code_block_attribute test. I hand patched it to get tests working but it needed a better solution. Write a simple html normalizer using HTMLParser so I don't need third party (lxml, beautifulsoup) library to clean up the test. Use the normalizer to parser the expected result and the result returned by the various markdown libraries. Hopefully this will make the test less fragile. This can have multiple uses in template testing where html is compared. I expect to have to change html_norm.py to make test writing easier in the future.
author John Rouillard <rouilj@ieee.org>
date Sun, 02 Oct 2022 23:18:43 -0400
parents
children 3546f23ea493
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
6995
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
1 """Minimal html parser/normalizer for use in test_templating.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
2
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
3 When testing markdown -> html coversion libraries, there are
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
4 gratuitous whitespace changes in generated output that break the
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
5 tests. Use this to try to normalize the generated HTML into something
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
6 that tries to preserve the semantic meaning allowing tests to stop
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
7 breaking.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
8
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
9 This is not a complete parsing engine. It supports the Roundup issue
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
10 tracker unit tests so that no third party libraries are needed to run
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
11 the tests. If you find it useful enjoy.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
12
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
13 Ideally this would be done by hijacking in some way
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
14 lxml.html.usedoctest to get a liberal parser that will ignore
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
15 whitespace. But that means the user has to install lxml to run the
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
16 tests. Simlarly BeautifulSoup could be used to pretty print the html
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
17 but again then BeautifulSoup would need to be instaled to run the
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
18 tests.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
19
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
20 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
21 from html.parser import HTMLParser
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
22
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
23 try:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
24 from htmlentitydefs import name2codepoint
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
25 except ImportError:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
26 pass # assume running under python3, name2codepoint predefined
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
27
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
28
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
29 class NormalizingHtmlParser(HTMLParser):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
30 """Handle start/end tags and normalize whitespace in data.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
31 Strip doctype, comments when passed in.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
32
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
33 Implements normalize method that takes input html and returns a
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
34 normalized string leaving the instance ready for another call to
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
35 normalize for another string.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
36
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
37
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
38 Note that using this rewrites all attributes parsed by HTMLParser
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
39 into attr="value" form even though HTMLParser accepts other
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
40 attribute specifiction forms.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
41 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
42
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
43 debug = False # set to true to enable more verbose output
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
44
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
45 current_normalized_string = "" # accumulate result string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
46 preserve_data = False # if inside pre preserve whitespace
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
47
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
48 def handle_starttag(self, tag, attrs):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
49 """put tag on new line with attributes.
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
50 Note valid attributes according to HTMLParser:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
51 attrs='single_quote'
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
52 attrs=noquote
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
53 attrs="double_quote"
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
54 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
55 if self.debug: print("Start tag:", tag)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
56
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
57 self.current_normalized_string += "\n<%s" % tag
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
58
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
59 for attr in attrs:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
60 if self.debug: print(" attr:", attr)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
61 self.current_normalized_string += ' %s="%s"' % attr
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
62
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
63 self.current_normalized_string += ">"
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
64
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
65 if tag == 'pre':
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
66 self.preserve_data = True
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
67
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
68 def handle_endtag(self, tag):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
69 if self.debug: print("End tag :", tag)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
70
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
71 self.current_normalized_string += "\n</%s>" % tag
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
72
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
73 if tag == 'pre':
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
74 self.preserve_data = False
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
75
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
76 def handle_data(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
77 if self.debug: print("Data :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
78 if not self.preserve_data:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
79 # normalize whitespace remove leading/trailing
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
80 data = " ".join(data.strip().split())
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
81
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
82 if data:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
83 self.current_normalized_string += "\n%s" % data
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
84
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
85 def handle_comment(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
86 print("Comment :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
87
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
88 def handle_entityref(self, name):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
89 c = chr(name2codepoint[name])
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
90 if self.debug: print("Named ent:", c)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
91
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
92 self.current_normalized_string += "%s" % c
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
93
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
94 def handle_charref(self, name):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
95 if name.startswith('x'):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
96 c = chr(int(name[1:], 16))
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
97 else:
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
98 c = chr(int(name))
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
99 if self.debug: print("Num ent :", c)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
100
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
101 self.current_normalized_string += "%s" % c
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
102
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
103 def handle_decl(self, data):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
104 print("Decl :", data)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
105
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
106 def reset(self):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
107 """wrapper around reset with clearing of csef.current_normalized_string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
108 and reset of self.preserve_data
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
109 """
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
110 HTMLParser.reset(self)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
111 self.current_normalized_string = ""
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
112 self.preserve_data = False
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
113
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
114 def normalize(self, html):
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
115 self.feed(html)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
116 result = self.current_normalized_string
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
117 self.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
118 return result
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
119
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
120
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
121 if __name__ == "__main__":
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
122 parser = NormalizingHtmlParser()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
123
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
124 parser.feed('<div class="markup"><p> paragraph text with whitespace\n and more space <pre><span class="f" data-attr="f">text more text</span></pre></div>')
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
125 print("\n\ntest1", parser.current_normalized_string)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
126
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
127 parser.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
128
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
129 parser.feed('''<div class="markup">
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
130 <p> paragraph text with whitespace\n and more space
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
131 <pre><span class="f" data-attr="f">text \n more text</span></pre>
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
132 </div>''')
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
133 print("\n\ntest2", parser.current_normalized_string)
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
134 parser.reset()
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
135 print("\n\nnormalize", parser.normalize('''<div class="markup">
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
136 <p> paragraph text with whitespace\n and more space
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
137 <pre><span class="f" data-attr="f">text \n more text &lt;</span></pre>
dc83ebff4c90 change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff changeset
138 </div>'''))

Roundup Issue Tracker: http://roundup-tracker.org/