annotate roundup/cgi/TAL/HTMLParser.py @ 8264:09e8d1a4c796

docs: clarify wording, fix index, add superseder link Make superseder, messages etc. properties index entries point to the right place. Link to description of using Superseder in the original overview. fix bad wording on boolean properties.
author John Rouillard <rouilj@ieee.org>
date Wed, 08 Jan 2025 11:39:54 -0500
parents 936275dfe1fa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 """A parser for HTML and XHTML."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 # This file is based on sgmllib.py, but the API is slightly different.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 # XXX There should be a way to distinguish between PCDATA (parsed
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 # character data -- the normal case), RCDATA (replaceable character
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # data -- only char and entity references and end tags are special)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 # and CDATA (character data -- only end tags are special).
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10
5388
d26921b851c3 Python 3 preparation: make relative imports explicit.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5377
diff changeset
11 from . import markupbase
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 import re
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14 # Regular expressions used for parsing
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 interesting_normal = re.compile('[&<]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 interesting_cdata = re.compile(r'<(/|\Z)')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 incomplete = re.compile('&[a-zA-Z#]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 charref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23 starttagopen = re.compile('<[a-zA-Z]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 piclose = re.compile('>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25 endtagopen = re.compile('</')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 commentclose = re.compile(r'--\s*>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 tagfind = re.compile('[a-zA-Z][-.a-zA-Z0-9:_]*')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 attrfind = re.compile(
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~]*))?')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 locatestarttagend = re.compile(r"""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 (?:\s+ # whitespace before attribute name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 (?:\s*=\s* # value indicator
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 (?:'[^']*' # LITA-enclosed value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 |\"[^\"]*\" # LIT-enclosed value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 |[^'\">\s]+ # bare value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 )
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 )?
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 )
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 )*
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 \s* # trailing whitespace
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 """, re.VERBOSE)
5809
936275dfe1fa Try to fix:
John Rouillard <rouilj@ieee.org>
parents: 5388
diff changeset
46 endendtag = re.compile(r'>')
936275dfe1fa Try to fix:
John Rouillard <rouilj@ieee.org>
parents: 5388
diff changeset
47 endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49
5265
63868084b8bb Python 2 and 3 support. Convert Exception to BaseException. TAL and
John Rouillard <rouilj@ieee.org>
parents: 2348
diff changeset
50 class HTMLParseError(BaseException):
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 """Exception raised for all parse errors."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 def __init__(self, msg, position=(None, None)):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 assert msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 self.msg = msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 self.lineno = position[0]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 self.offset = position[1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59 def __str__(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60 result = self.msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 if self.lineno is not None:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 result = result + ", at line %d" % self.lineno
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 if self.offset is not None:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 result = result + ", column %d" % (self.offset + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 return result
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 def _contains_at(s, sub, pos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 return s[pos:pos+len(sub)] == sub
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72 class HTMLParser(markupbase.ParserBase):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 """Find tags and other markup and call handler functions.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 Usage:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 p = HTMLParser()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 p.feed(data)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78 ...
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 p.close()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81 Start tags are handled by calling self.handle_starttag() or
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82 self.handle_startendtag(); end tags by self.handle_endtag(). The
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83 data between tags is passed from the parser to the derived class
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 by calling self.handle_data() with the data as argument (the data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 may be split up in arbitrary chunks). Entity references are
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86 passed by calling self.handle_entityref() with the entity
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 reference as the argument. Numeric character references are
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88 passed to self.handle_charref() with the string containing the
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89 reference as the argument.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 """
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 CDATA_CONTENT_ELEMENTS = ("script", "style")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
94
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
95 def __init__(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
96 """Initialize and reset this instance."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
97 self.reset()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
98
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
99 def reset(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
100 """Reset this instance. Loses all unprocessed data."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
101 self.rawdata = ''
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
102 self.stack = []
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
103 self.lasttag = '???'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
104 self.interesting = interesting_normal
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
105 markupbase.ParserBase.reset(self)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
106
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
107 def feed(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
108 """Feed data to the parser.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
109
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
110 Call this as often as you want, with as little or as much text
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
111 as you want (may include '\n').
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
112 """
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
113 self.rawdata = self.rawdata + data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
114 self.goahead(0)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
115
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
116 def close(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
117 """Handle any buffered data."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
118 self.goahead(1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
119
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
120 def error(self, message):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
121 raise HTMLParseError(message, self.getpos())
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
122
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
123 __starttag_text = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
124
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
125 def get_starttag_text(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
126 """Return full source of start tag: '<...>'."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
127 return self.__starttag_text
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
128
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
129 cdata_endtag = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
130
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131 def set_cdata_mode(self, endtag=None):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132 self.cdata_endtag = endtag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
133 self.interesting = interesting_cdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
134
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
135 def clear_cdata_mode(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
136 self.cdata_endtag = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
137 self.interesting = interesting_normal
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
138
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
139 # Internal -- handle data as far as reasonable. May leave state
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
140 # and data to be processed by a subsequent call. If 'end' is
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
141 # true, force handling all data as if followed by EOF marker.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
142 def goahead(self, end):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
143 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
144 i = 0
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
145 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
146 while i < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
147 match = self.interesting.search(rawdata, i) # < or &
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
148 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
149 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
150 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
151 j = n
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
152 if i < j: self.handle_data(rawdata[i:j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
153 i = self.updatepos(i, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
154 if i == n: break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
155 if rawdata[i] == '<':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
156 if starttagopen.match(rawdata, i): # < + letter
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
157 k = self.parse_starttag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
158 elif endtagopen.match(rawdata, i): # </
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
159 k = self.parse_endtag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
160 elif _contains_at(rawdata, "<!--", i): # <!--
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
161 k = self.parse_comment(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
162 elif _contains_at(rawdata, "<!", i): # <!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
163 k = self.parse_declaration(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
164 elif _contains_at(rawdata, "<?", i): # <?
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
165 k = self.parse_pi(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
166 elif _contains_at(rawdata, "<?", i): # <!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
167 k = self.parse_declaration(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
168 elif (i + 1) < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
169 self.handle_data("<")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
170 k = i + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
171 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
172 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
173 if k < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
174 if end:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
175 self.error("EOF in middle of construct")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
176 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
177 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
178 elif rawdata[i:i+2] == "&#":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
179 match = charref.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
180 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
181 name = match.group()[2:-1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
182 self.handle_charref(name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
183 k = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
184 if rawdata[k-1] != ';':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
185 k = k - 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
186 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
187 continue
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
188 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
189 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
190 elif rawdata[i] == '&':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
191 match = entityref.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
192 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
193 name = match.group(1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
194 self.handle_entityref(name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
195 k = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
196 if rawdata[k-1] != ';':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
197 k = k - 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
198 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
199 continue
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
200 match = incomplete.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
201 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
202 # match.group() will contain at least 2 chars
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
203 rest = rawdata[i:]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
204 if end and match.group() == rest:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
205 self.error("EOF in middle of entity or char ref")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
206 # incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
207 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
208 elif (i + 1) < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
209 # not the end of the buffer, and can't be confused
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
210 # with some other construct
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
211 self.handle_data("&")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
212 i = self.updatepos(i, i + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
213 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
214 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
215 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
216 assert 0, "interesting.search() lied"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
217 # end while
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
218 if end and i < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
219 self.handle_data(rawdata[i:n])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
220 i = self.updatepos(i, n)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
221 self.rawdata = rawdata[i:]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
222
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
223 # Internal -- parse comment, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
224 def parse_comment(self, i, report=1):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
225 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
226 assert rawdata[i:i+4] == '<!--', 'unexpected call to parse_comment()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
227 match = commentclose.search(rawdata, i+4)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
228 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
229 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
230 if report:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
231 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
232 self.handle_comment(rawdata[i+4: j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
233 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
234 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
235
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
236 # Internal -- parse processing instr, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
237 def parse_pi(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
238 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
239 assert rawdata[i:i+2] == '<?', 'unexpected call to parse_pi()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
240 match = piclose.search(rawdata, i+2) # >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
241 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
242 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
243 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
244 self.handle_pi(rawdata[i+2: j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
245 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
246 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
247
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
248 # Internal -- handle starttag, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
249 def parse_starttag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
250 self.__starttag_text = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
251 endpos = self.check_for_whole_start_tag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
252 if endpos < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
253 return endpos
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
254 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
255 self.__starttag_text = rawdata[i:endpos]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
256
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
257 # Now parse the data between i+1 and j into a tag and attrs
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
258 attrs = []
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
259 match = tagfind.match(rawdata, i+1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
260 assert match, 'unexpected call to parse_starttag()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
261 k = match.end()
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
262 self.lasttag = tag = rawdata[i+1:k].lower()
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
263
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
264 while k < endpos:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
265 m = attrfind.match(rawdata, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
266 if not m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
267 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
268 attrname, rest, attrvalue = m.group(1, 2, 3)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
269 if not rest:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
270 attrvalue = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
271 elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
272 attrvalue[:1] == '"' == attrvalue[-1:]:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
273 attrvalue = attrvalue[1:-1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
274 attrvalue = self.unescape(attrvalue)
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
275 attrs.append((attrname.lower(), attrvalue))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
276 k = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
277
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
278 end = rawdata[k:endpos].strip()
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
279 if end not in (">", "/>"):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
280 lineno, offset = self.getpos()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
281 if "\n" in self.__starttag_text:
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
282 lineno = lineno + self.__starttag_text.count("\n")
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
283 offset = len(self.__starttag_text) \
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
284 - self.__starttag_text.rfind("\n")
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
285 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
286 offset = offset + len(self.__starttag_text)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
287 self.error("junk characters in start tag: %s"
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5265
diff changeset
288 % repr(rawdata[k:endpos][:20]))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
289 if end[-2:] == '/>':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
290 # XHTML-style empty tag: <span attr="value" />
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
291 self.handle_startendtag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
292 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
293 self.handle_starttag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
294 if tag in self.CDATA_CONTENT_ELEMENTS:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
295 self.set_cdata_mode(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
296 return endpos
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
297
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
298 # Internal -- check to see if we have a complete starttag; return end
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
299 # or -1 if incomplete.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
300 def check_for_whole_start_tag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
301 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
302 m = locatestarttagend.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
303 if m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
304 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
305 next = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
306 if next == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
307 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
308 if next == "/":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
309 s = rawdata[j:j+2]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
310 if s == "/>":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
311 return j + 2
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
312 if s == "/":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
313 # buffer boundary
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
314 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
315 # else bogus input
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
316 self.updatepos(i, j + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
317 self.error("malformed empty start tag")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
318 if next == "":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
319 # end of input
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
320 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
321 if next in ("abcdefghijklmnopqrstuvwxyz=/"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
322 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
323 # end of input in or before attribute value, or we have the
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
324 # '/' from a '/>' ending
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
325 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
326 self.updatepos(i, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
327 self.error("malformed start tag")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
328 raise AssertionError("we should not get here!")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
329
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
330 # Internal -- parse endtag, return end or -1 if incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
331 def parse_endtag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
332 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
333 assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
334 match = endendtag.search(rawdata, i+1) # >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
335 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
336 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
337 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
338 match = endtagfind.match(rawdata, i) # </ + tag + >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
339 if not match:
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5265
diff changeset
340 self.error("bad end tag: %s" % repr(rawdata[i:j]))
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
341 tag = match.group(1).lower()
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
342 if ( self.cdata_endtag is not None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
343 and tag != self.cdata_endtag):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
344 # Should be a mismatched end tag, but we'll treat it
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
345 # as text anyway, since most HTML authors aren't
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
346 # interested in the finer points of syntax.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
347 self.handle_data(match.group(0))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
348 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
349 self.handle_endtag(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
350 self.clear_cdata_mode()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
351 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
352
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
353 # Overridable -- finish processing of start+end tag: <tag.../>
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
354 def handle_startendtag(self, tag, attrs):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
355 self.handle_starttag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
356 self.handle_endtag(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
357
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
358 # Overridable -- handle start tag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
359 def handle_starttag(self, tag, attrs):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
360 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
361
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
362 # Overridable -- handle end tag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
363 def handle_endtag(self, tag):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
364 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
365
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
366 # Overridable -- handle character reference
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
367 def handle_charref(self, name):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
368 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
369
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
370 # Overridable -- handle entity reference
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
371 def handle_entityref(self, name):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
372 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
373
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
374 # Overridable -- handle data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
375 def handle_data(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
376 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
377
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
378 # Overridable -- handle comment
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
379 def handle_comment(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
380 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
381
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
382 # Overridable -- handle declaration
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
383 def handle_decl(self, decl):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
384 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
385
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
386 # Overridable -- handle processing instruction
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
387 def handle_pi(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
388 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
389
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
390 def unknown_decl(self, data):
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5265
diff changeset
391 self.error("unknown declaration: " + repr(data))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
392
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
393 # Internal -- helper to remove special character quoting
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
394 def unescape(self, s):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
395 if '&' not in s:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
396 return s
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
397 s = s.replace("&lt;", "<")
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
398 s = s.replace("&gt;", ">")
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
399 s = s.replace("&apos;", "'")
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
400 s = s.replace("&quot;", '"')
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
401 s = s.replace("&amp;", "&") # Must be last
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
402 return s

Roundup Issue Tracker: http://roundup-tracker.org/