Mercurial > p > roundup > code
annotate test/html_norm.py @ 8005:817e8875556b 2.4.0b0
Removed tag 2.4.0b0
| author | John Rouillard <rouilj@ieee.org> |
|---|---|
| date | Sun, 26 May 2024 21:48:11 -0400 |
| parents | 5cadcaa13bed |
| children |
| rev | line source |
|---|---|
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
1 """Minimal html parser/normalizer for use in test_templating. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
2 |
| 7077 | 3 When testing markdown -> html conversion libraries, there are |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
4 gratuitous whitespace changes in generated output that break the |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
5 tests. Use this to try to normalize the generated HTML into something |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
6 that tries to preserve the semantic meaning allowing tests to stop |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
7 breaking. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
8 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
9 This is not a complete parsing engine. It supports the Roundup issue |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
10 tracker unit tests so that no third party libraries are needed to run |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
11 the tests. If you find it useful enjoy. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
12 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
13 Ideally this would be done by hijacking in some way |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
14 lxml.html.usedoctest to get a liberal parser that will ignore |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
15 whitespace. But that means the user has to install lxml to run the |
| 7077 | 16 tests. Similarly BeautifulSoup could be used to pretty print the html |
| 17 but again, BeautifulSoup would need to be installed to run the | |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
18 tests. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
19 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
20 """ |
| 6996 | 21 try: |
| 22 from html.parser import HTMLParser | |
| 23 except ImportError: | |
| 24 from HTMLParser import HTMLParser # python2 | |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
25 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
26 try: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
27 from htmlentitydefs import name2codepoint |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
28 except ImportError: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
29 pass # assume running under python3, name2codepoint predefined |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
30 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
31 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
32 class NormalizingHtmlParser(HTMLParser): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
33 """Handle start/end tags and normalize whitespace in data. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
34 Strip doctype, comments when passed in. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
35 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
36 Implements normalize method that takes input html and returns a |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
37 normalized string leaving the instance ready for another call to |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
38 normalize for another string. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
39 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
40 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
41 Note that using this rewrites all attributes parsed by HTMLParser |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
42 into attr="value" form even though HTMLParser accepts other |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
43 attribute specification forms. |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
44 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
45 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
46 debug = False # set to true to enable more verbose output |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
47 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
48 current_normalized_string = "" # accumulate result string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
49 preserve_data = False # if inside pre preserve whitespace |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
50 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
51 def handle_starttag(self, tag, attrs): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
52 """put tag on new line with attributes. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
53 Note valid attributes according to HTMLParser: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
54 attrs='single_quote' |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
55 attrs=noquote |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
56 attrs="double_quote" |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
57 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
58 if self.debug: print("Start tag:", tag) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
59 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
60 self.current_normalized_string += "\n<%s" % tag |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
61 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
62 for attr in attrs: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
63 if self.debug: print(" attr:", attr) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
64 self.current_normalized_string += ' %s="%s"' % attr |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
65 |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
66 self.current_normalized_string += ">\n" |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
67 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
68 if tag == 'pre': |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
69 self.preserve_data = True |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
70 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
71 def handle_endtag(self, tag): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
72 if self.debug: print("End tag :", tag) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
73 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
74 self.current_normalized_string += "\n</%s>" % tag |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
75 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
76 if tag == 'pre': |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
77 self.preserve_data = False |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
78 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
79 def handle_data(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
80 if self.debug: print("Data :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
81 if not self.preserve_data: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
82 # normalize whitespace remove leading/trailing |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
83 data = " ".join(data.strip().split()) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
84 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
85 if data: |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
86 self.current_normalized_string += "%s" % data |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
87 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
88 def handle_comment(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
89 print("Comment :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
90 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
91 def handle_decl(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
92 print("Decl :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
93 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
94 def reset(self): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
95 """wrapper around reset with clearing of csef.current_normalized_string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
96 and reset of self.preserve_data |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
97 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
98 HTMLParser.reset(self) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
99 self.current_normalized_string = "" |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
100 self.preserve_data = False |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
101 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
102 def normalize(self, html): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
103 self.feed(html) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
104 result = self.current_normalized_string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
105 self.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
106 return result |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
107 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
108 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
109 if __name__ == "__main__": |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
110 parser = NormalizingHtmlParser() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
111 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
112 parser.feed('<div class="markup"><p> paragraph text with whitespace\n and more space <pre><span class="f" data-attr="f">text more text</span></pre></div>') |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
113 print("\n\ntest1", parser.current_normalized_string) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
114 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
115 parser.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
116 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
117 parser.feed('''<div class="markup"> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
118 <p> paragraph text with whitespace\n and more space |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
119 <pre><span class="f" data-attr="f">text \n more text</span></pre> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
120 </div>''') |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
121 print("\n\ntest2", parser.current_normalized_string) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
122 parser.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
123 print("\n\nnormalize", parser.normalize('''<div class="markup"> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
124 <p> paragraph text with whitespace\n and more space |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
125 <pre><span class="f" data-attr="f">text \n more text <</span></pre> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
126 </div>''')) |
