annotate roundup/cgi/TAL/HTMLParser.py @ 2077:3e0961d6d44d

Added the "actor" property. Metakit backend not done (still not confident I know how it's supposed to work ;) Currently it will come up as NULL in the RDBMS backends for older items. The *dbm backends will look up the journal. I hope to remedy the former before 0.7's release. Fixed a bunch of migration issues in the rdbms backends while I was at it (index changes for key prop changes) and simplified the class table update code for RDBMSes that have "alter table" in their command set (ie. not sqlite) ... migration from "version 1" to "version 2" still hasn't actually been tested yet though.
author Richard Jones <richard@users.sourceforge.net>
date Mon, 15 Mar 2004 05:50:20 +0000
parents fc52d57c6c3e
children 8c2402a78bb0
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 """A parser for HTML and XHTML."""
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1049
diff changeset
2 __docformat__ = 'restructuredtext'
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 # This file is based on sgmllib.py, but the API is slightly different.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 # XXX There should be a way to distinguish between PCDATA (parsed
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # character data -- the normal case), RCDATA (replaceable character
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 # data -- only char and entity references and end tags are special)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9 # and CDATA (character data -- only end tags are special).
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 import markupbase
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 import re
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14 import string
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 # Regular expressions used for parsing
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 interesting_normal = re.compile('[&<]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19 interesting_cdata = re.compile(r'<(/|\Z)')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 incomplete = re.compile('&[a-zA-Z#]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22 entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23 charref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25 starttagopen = re.compile('<[a-zA-Z]')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 piclose = re.compile('>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 endtagopen = re.compile('</')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 commentclose = re.compile(r'--\s*>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 tagfind = re.compile('[a-zA-Z][-.a-zA-Z0-9:_]*')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 attrfind = re.compile(
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~]*))?')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 locatestarttagend = re.compile(r"""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 (?:\s+ # whitespace before attribute name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 (?:\s*=\s* # value indicator
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 (?:'[^']*' # LITA-enclosed value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 |\"[^\"]*\" # LIT-enclosed value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 |[^'\">\s]+ # bare value
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 )
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 )?
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 )
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 )*
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46 \s* # trailing whitespace
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 """, re.VERBOSE)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48 endendtag = re.compile('>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 class HTMLParseError(Exception):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 """Exception raised for all parse errors."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 def __init__(self, msg, position=(None, None)):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 assert msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 self.msg = msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58 self.lineno = position[0]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59 self.offset = position[1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 def __str__(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 result = self.msg
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 if self.lineno is not None:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 result = result + ", at line %d" % self.lineno
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 if self.offset is not None:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66 result = result + ", column %d" % (self.offset + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67 return result
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70 def _contains_at(s, sub, pos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 return s[pos:pos+len(sub)] == sub
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 class HTMLParser(markupbase.ParserBase):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 """Find tags and other markup and call handler functions.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 Usage:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78 p = HTMLParser()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 p.feed(data)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80 ...
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81 p.close()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83 Start tags are handled by calling self.handle_starttag() or
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 self.handle_startendtag(); end tags by self.handle_endtag(). The
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 data between tags is passed from the parser to the derived class
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86 by calling self.handle_data() with the data as argument (the data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 may be split up in arbitrary chunks). Entity references are
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88 passed by calling self.handle_entityref() with the entity
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89 reference as the argument. Numeric character references are
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 passed to self.handle_charref() with the string containing the
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91 reference as the argument.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 """
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
94 CDATA_CONTENT_ELEMENTS = ("script", "style")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
95
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
96
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
97 def __init__(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
98 """Initialize and reset this instance."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
99 self.reset()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
100
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
101 def reset(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
102 """Reset this instance. Loses all unprocessed data."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
103 self.rawdata = ''
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
104 self.stack = []
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
105 self.lasttag = '???'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
106 self.interesting = interesting_normal
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
107 markupbase.ParserBase.reset(self)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
108
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
109 def feed(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
110 """Feed data to the parser.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
111
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
112 Call this as often as you want, with as little or as much text
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
113 as you want (may include '\n').
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
114 """
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
115 self.rawdata = self.rawdata + data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
116 self.goahead(0)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
117
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
118 def close(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
119 """Handle any buffered data."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
120 self.goahead(1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
121
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
122 def error(self, message):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
123 raise HTMLParseError(message, self.getpos())
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
124
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
125 __starttag_text = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
126
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
127 def get_starttag_text(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
128 """Return full source of start tag: '<...>'."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
129 return self.__starttag_text
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
130
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131 cdata_endtag = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
133 def set_cdata_mode(self, endtag=None):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
134 self.cdata_endtag = endtag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
135 self.interesting = interesting_cdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
136
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
137 def clear_cdata_mode(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
138 self.cdata_endtag = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
139 self.interesting = interesting_normal
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
140
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
141 # Internal -- handle data as far as reasonable. May leave state
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
142 # and data to be processed by a subsequent call. If 'end' is
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
143 # true, force handling all data as if followed by EOF marker.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
144 def goahead(self, end):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
145 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
146 i = 0
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
147 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
148 while i < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
149 match = self.interesting.search(rawdata, i) # < or &
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
150 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
151 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
152 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
153 j = n
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
154 if i < j: self.handle_data(rawdata[i:j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
155 i = self.updatepos(i, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
156 if i == n: break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
157 if rawdata[i] == '<':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
158 if starttagopen.match(rawdata, i): # < + letter
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
159 k = self.parse_starttag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
160 elif endtagopen.match(rawdata, i): # </
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
161 k = self.parse_endtag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
162 elif _contains_at(rawdata, "<!--", i): # <!--
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
163 k = self.parse_comment(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
164 elif _contains_at(rawdata, "<!", i): # <!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
165 k = self.parse_declaration(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
166 elif _contains_at(rawdata, "<?", i): # <?
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
167 k = self.parse_pi(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
168 elif _contains_at(rawdata, "<?", i): # <!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
169 k = self.parse_declaration(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
170 elif (i + 1) < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
171 self.handle_data("<")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
172 k = i + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
173 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
174 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
175 if k < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
176 if end:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
177 self.error("EOF in middle of construct")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
178 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
179 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
180 elif rawdata[i:i+2] == "&#":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
181 match = charref.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
182 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
183 name = match.group()[2:-1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
184 self.handle_charref(name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
185 k = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
186 if rawdata[k-1] != ';':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
187 k = k - 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
188 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
189 continue
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
190 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
191 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
192 elif rawdata[i] == '&':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
193 match = entityref.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
194 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
195 name = match.group(1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
196 self.handle_entityref(name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
197 k = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
198 if rawdata[k-1] != ';':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
199 k = k - 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
200 i = self.updatepos(i, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
201 continue
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
202 match = incomplete.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
203 if match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
204 # match.group() will contain at least 2 chars
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
205 rest = rawdata[i:]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
206 if end and match.group() == rest:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
207 self.error("EOF in middle of entity or char ref")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
208 # incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
209 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
210 elif (i + 1) < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
211 # not the end of the buffer, and can't be confused
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
212 # with some other construct
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
213 self.handle_data("&")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
214 i = self.updatepos(i, i + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
215 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
216 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
217 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
218 assert 0, "interesting.search() lied"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
219 # end while
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
220 if end and i < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
221 self.handle_data(rawdata[i:n])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
222 i = self.updatepos(i, n)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
223 self.rawdata = rawdata[i:]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
224
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
225 # Internal -- parse comment, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
226 def parse_comment(self, i, report=1):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
227 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
228 assert rawdata[i:i+4] == '<!--', 'unexpected call to parse_comment()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
229 match = commentclose.search(rawdata, i+4)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
230 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
231 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
232 if report:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
233 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
234 self.handle_comment(rawdata[i+4: j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
235 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
236 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
237
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
238 # Internal -- parse processing instr, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
239 def parse_pi(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
240 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
241 assert rawdata[i:i+2] == '<?', 'unexpected call to parse_pi()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
242 match = piclose.search(rawdata, i+2) # >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
243 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
244 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
245 j = match.start()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
246 self.handle_pi(rawdata[i+2: j])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
247 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
248 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
249
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
250 # Internal -- handle starttag, return end or -1 if not terminated
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
251 def parse_starttag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
252 self.__starttag_text = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
253 endpos = self.check_for_whole_start_tag(i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
254 if endpos < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
255 return endpos
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
256 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
257 self.__starttag_text = rawdata[i:endpos]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
258
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
259 # Now parse the data between i+1 and j into a tag and attrs
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
260 attrs = []
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
261 match = tagfind.match(rawdata, i+1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
262 assert match, 'unexpected call to parse_starttag()'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
263 k = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
264 self.lasttag = tag = string.lower(rawdata[i+1:k])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
265
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
266 while k < endpos:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
267 m = attrfind.match(rawdata, k)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
268 if not m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
269 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
270 attrname, rest, attrvalue = m.group(1, 2, 3)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
271 if not rest:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
272 attrvalue = None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
273 elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
274 attrvalue[:1] == '"' == attrvalue[-1:]:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
275 attrvalue = attrvalue[1:-1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
276 attrvalue = self.unescape(attrvalue)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
277 attrs.append((string.lower(attrname), attrvalue))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
278 k = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
279
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
280 end = string.strip(rawdata[k:endpos])
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
281 if end not in (">", "/>"):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
282 lineno, offset = self.getpos()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
283 if "\n" in self.__starttag_text:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
284 lineno = lineno + string.count(self.__starttag_text, "\n")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
285 offset = len(self.__starttag_text) \
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
286 - string.rfind(self.__starttag_text, "\n")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
287 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
288 offset = offset + len(self.__starttag_text)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
289 self.error("junk characters in start tag: %s"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
290 % `rawdata[k:endpos][:20]`)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
291 if end[-2:] == '/>':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
292 # XHTML-style empty tag: <span attr="value" />
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
293 self.handle_startendtag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
294 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
295 self.handle_starttag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
296 if tag in self.CDATA_CONTENT_ELEMENTS:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
297 self.set_cdata_mode(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
298 return endpos
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
299
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
300 # Internal -- check to see if we have a complete starttag; return end
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
301 # or -1 if incomplete.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
302 def check_for_whole_start_tag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
303 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
304 m = locatestarttagend.match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
305 if m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
306 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
307 next = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
308 if next == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
309 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
310 if next == "/":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
311 s = rawdata[j:j+2]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
312 if s == "/>":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
313 return j + 2
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
314 if s == "/":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
315 # buffer boundary
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
316 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
317 # else bogus input
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
318 self.updatepos(i, j + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
319 self.error("malformed empty start tag")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
320 if next == "":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
321 # end of input
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
322 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
323 if next in ("abcdefghijklmnopqrstuvwxyz=/"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
324 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
325 # end of input in or before attribute value, or we have the
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
326 # '/' from a '/>' ending
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
327 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
328 self.updatepos(i, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
329 self.error("malformed start tag")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
330 raise AssertionError("we should not get here!")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
331
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
332 # Internal -- parse endtag, return end or -1 if incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
333 def parse_endtag(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
334 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
335 assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
336 match = endendtag.search(rawdata, i+1) # >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
337 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
338 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
339 j = match.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
340 match = endtagfind.match(rawdata, i) # </ + tag + >
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
341 if not match:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
342 self.error("bad end tag: %s" % `rawdata[i:j]`)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
343 tag = string.lower(match.group(1))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
344 if ( self.cdata_endtag is not None
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
345 and tag != self.cdata_endtag):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
346 # Should be a mismatched end tag, but we'll treat it
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
347 # as text anyway, since most HTML authors aren't
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
348 # interested in the finer points of syntax.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
349 self.handle_data(match.group(0))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
350 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
351 self.handle_endtag(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
352 self.clear_cdata_mode()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
353 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
354
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
355 # Overridable -- finish processing of start+end tag: <tag.../>
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
356 def handle_startendtag(self, tag, attrs):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
357 self.handle_starttag(tag, attrs)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
358 self.handle_endtag(tag)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
359
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
360 # Overridable -- handle start tag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
361 def handle_starttag(self, tag, attrs):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
362 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
363
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
364 # Overridable -- handle end tag
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
365 def handle_endtag(self, tag):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
366 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
367
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
368 # Overridable -- handle character reference
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
369 def handle_charref(self, name):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
370 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
371
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
372 # Overridable -- handle entity reference
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
373 def handle_entityref(self, name):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
374 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
375
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
376 # Overridable -- handle data
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
377 def handle_data(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
378 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
379
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
380 # Overridable -- handle comment
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
381 def handle_comment(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
382 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
383
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
384 # Overridable -- handle declaration
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
385 def handle_decl(self, decl):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
386 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
387
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
388 # Overridable -- handle processing instruction
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
389 def handle_pi(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
390 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
391
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
392 def unknown_decl(self, data):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
393 self.error("unknown declaration: " + `data`)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
394
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
395 # Internal -- helper to remove special character quoting
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
396 def unescape(self, s):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
397 if '&' not in s:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
398 return s
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
399 s = string.replace(s, "&lt;", "<")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
400 s = string.replace(s, "&gt;", ">")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
401 s = string.replace(s, "&apos;", "'")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
402 s = string.replace(s, "&quot;", '"')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
403 s = string.replace(s, "&amp;", "&") # Must be last
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
404 return s

Roundup Issue Tracker: http://roundup-tracker.org/