Mercurial > p > roundup > code
annotate test/html_norm.py @ 7864:b080cdb8b199
fix: document/fix wrapped HtmlProperty method.
The wrapped method was not documented in reference.txt.
It is now documented in reference.txt. The docstring documented that
it would not break up long words. Fixed by adding
break_long_words=False to prevent breaking string longer than the wrap
length. Wrapping was breaking the hyperlinking of long urls.
Added columns argument to set the wrap length (default 80 columns).
| author | John Rouillard <rouilj@ieee.org> |
|---|---|
| date | Sun, 07 Apr 2024 15:27:18 -0400 |
| parents | 5cadcaa13bed |
| children |
| rev | line source |
|---|---|
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
1 """Minimal html parser/normalizer for use in test_templating. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
2 |
| 7077 | 3 When testing markdown -> html conversion libraries, there are |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
4 gratuitous whitespace changes in generated output that break the |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
5 tests. Use this to try to normalize the generated HTML into something |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
6 that tries to preserve the semantic meaning allowing tests to stop |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
7 breaking. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
8 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
9 This is not a complete parsing engine. It supports the Roundup issue |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
10 tracker unit tests so that no third party libraries are needed to run |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
11 the tests. If you find it useful enjoy. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
12 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
13 Ideally this would be done by hijacking in some way |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
14 lxml.html.usedoctest to get a liberal parser that will ignore |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
15 whitespace. But that means the user has to install lxml to run the |
| 7077 | 16 tests. Similarly BeautifulSoup could be used to pretty print the html |
| 17 but again, BeautifulSoup would need to be installed to run the | |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
18 tests. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
19 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
20 """ |
| 6996 | 21 try: |
| 22 from html.parser import HTMLParser | |
| 23 except ImportError: | |
| 24 from HTMLParser import HTMLParser # python2 | |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
25 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
26 try: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
27 from htmlentitydefs import name2codepoint |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
28 except ImportError: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
29 pass # assume running under python3, name2codepoint predefined |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
30 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
31 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
32 class NormalizingHtmlParser(HTMLParser): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
33 """Handle start/end tags and normalize whitespace in data. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
34 Strip doctype, comments when passed in. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
35 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
36 Implements normalize method that takes input html and returns a |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
37 normalized string leaving the instance ready for another call to |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
38 normalize for another string. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
39 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
40 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
41 Note that using this rewrites all attributes parsed by HTMLParser |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
42 into attr="value" form even though HTMLParser accepts other |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
43 attribute specification forms. |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
44 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
45 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
46 debug = False # set to true to enable more verbose output |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
47 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
48 current_normalized_string = "" # accumulate result string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
49 preserve_data = False # if inside pre preserve whitespace |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
50 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
51 def handle_starttag(self, tag, attrs): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
52 """put tag on new line with attributes. |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
53 Note valid attributes according to HTMLParser: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
54 attrs='single_quote' |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
55 attrs=noquote |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
56 attrs="double_quote" |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
57 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
58 if self.debug: print("Start tag:", tag) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
59 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
60 self.current_normalized_string += "\n<%s" % tag |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
61 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
62 for attr in attrs: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
63 if self.debug: print(" attr:", attr) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
64 self.current_normalized_string += ' %s="%s"' % attr |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
65 |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
66 self.current_normalized_string += ">\n" |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
67 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
68 if tag == 'pre': |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
69 self.preserve_data = True |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
70 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
71 def handle_endtag(self, tag): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
72 if self.debug: print("End tag :", tag) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
73 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
74 self.current_normalized_string += "\n</%s>" % tag |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
75 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
76 if tag == 'pre': |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
77 self.preserve_data = False |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
78 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
79 def handle_data(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
80 if self.debug: print("Data :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
81 if not self.preserve_data: |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
82 # normalize whitespace remove leading/trailing |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
83 data = " ".join(data.strip().split()) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
84 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
85 if data: |
|
7560
5cadcaa13bed
prevent <newline tag mangling
John Rouillard <rouilj@ieee.org>
parents:
7077
diff
changeset
|
86 self.current_normalized_string += "%s" % data |
|
6995
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
87 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
88 def handle_comment(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
89 print("Comment :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
90 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
91 def handle_decl(self, data): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
92 print("Decl :", data) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
93 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
94 def reset(self): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
95 """wrapper around reset with clearing of csef.current_normalized_string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
96 and reset of self.preserve_data |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
97 """ |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
98 HTMLParser.reset(self) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
99 self.current_normalized_string = "" |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
100 self.preserve_data = False |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
101 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
102 def normalize(self, html): |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
103 self.feed(html) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
104 result = self.current_normalized_string |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
105 self.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
106 return result |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
107 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
108 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
109 if __name__ == "__main__": |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
110 parser = NormalizingHtmlParser() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
111 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
112 parser.feed('<div class="markup"><p> paragraph text with whitespace\n and more space <pre><span class="f" data-attr="f">text more text</span></pre></div>') |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
113 print("\n\ntest1", parser.current_normalized_string) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
114 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
115 parser.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
116 |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
117 parser.feed('''<div class="markup"> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
118 <p> paragraph text with whitespace\n and more space |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
119 <pre><span class="f" data-attr="f">text \n more text</span></pre> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
120 </div>''') |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
121 print("\n\ntest2", parser.current_normalized_string) |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
122 parser.reset() |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
123 print("\n\nnormalize", parser.normalize('''<div class="markup"> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
124 <p> paragraph text with whitespace\n and more space |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
125 <pre><span class="f" data-attr="f">text \n more text <</span></pre> |
|
dc83ebff4c90
change test to use html normalizer when comparing html output.
John Rouillard <rouilj@ieee.org>
parents:
diff
changeset
|
126 </div>''')) |
