annotate roundup/cgi/TAL/markupbase.py @ 6593:e70e2789bc2c

issue2551189 - increase text search maxlength This removes I think all the magic references to 25 and 30 (varchar size) and replaces them with references to maxlength or maxlength+5. I am not sure why the db column is 5 characters larger than the size of what should be the max size of a word, but I'll keep the buffer of 5 as making it 1/5 the size of maxlength makes less sense. Also added tests for fts search in templating which were missing. Added postgres, mysql and sqlite native indexing backends in which to test fts. Added fts test to native-fts as well to make sure it's working. I want to commit this now for CI. Todo: add test cases for the use of FTS in the csv output in actions.py. There is no test coverage of the match case there. change maxlength to a higher value (50) as requested in the ticket. Modify existing extremewords test cases to allow words > 25 and < 51 write code to migrate column sizes for mysql and postgresql to match maxlength I will roll this into the version 7 schema update that supports use of database fts support.
author John Rouillard <rouilj@ieee.org>
date Tue, 25 Jan 2022 13:22:00 -0500
parents 12fe83f90f0d
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
1 """Shared support for scanning document type declarations in HTML and XHTML."""
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
3 import re, string
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 _declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*\s*').match
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 _declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 del re
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 class ParserBase:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 """Parser base class which provides some common support methods used
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 by the SGML/HTML and XHTML parsers."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15 def reset(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 self.lineno = 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 self.offset = 0
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19 def getpos(self):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 """Return current line number and offset."""
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 return self.lineno, self.offset
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
23 def error(self, message):
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
24 """Return an error, showing current line number and offset.
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
25
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
26 Concrete subclasses *must* override this method.
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
27 """
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
28 raise NotImplementedError
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
29
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 # Internal -- update line number and offset. This should be
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 # called for each piece of data exactly once, in order -- in other
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 # words the concatenation of all the input strings to this
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 # function should be exactly the entire input.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 def updatepos(self, i, j):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 if i >= j:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 rawdata = self.rawdata
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
38 nlines = rawdata.count("\n", i, j)
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 if nlines:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 self.lineno = self.lineno + nlines
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
41 pos = rawdata.rindex("\n", i, j) # Should not fail
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 self.offset = j-(pos+1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 self.offset = self.offset + j-i
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 _decl_otherchars = ''
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 # Internal -- parse declaration (for use by subclasses).
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 def parse_declaration(self, i):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 # This is some sort of declaration; in "HTML as
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 # deployed," this should only be the document type
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 # declaration ("<!DOCTYPE html...>").
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 import sys
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 j = i + 2
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 assert rawdata[i:j] == "<!", "unexpected call to parse_declaration"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58 if rawdata[j:j+1] in ("-", ""):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59 # Start of comment followed by buffer boundary,
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60 # or just a buffer boundary.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 # in practice, this should look like: ((name|stringlit) S*)+ '>'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 decltype, j = self._scan_name(j, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67 if decltype == "doctype":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 self._decl_otherchars = ''
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 while j < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70 c = rawdata[j]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 if c == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72 # end of declaration syntax
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 data = rawdata[i+2:j]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 if decltype == "doctype":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 self.handle_decl(data)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 self.unknown_decl(data)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 if c in "\"'":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80 m = _declstringlit_match(rawdata, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81 if not m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82 return -1 # incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 elif c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 name, j = self._scan_name(j, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86 elif c in self._decl_otherchars:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88 elif c == "[":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89 if decltype == "doctype":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 j = self._parse_doctype_subset(j + 1, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 self.error("unexpected '[' char in declaration")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
94 self.error(
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 2348
diff changeset
95 "unexpected %s char in declaration" % repr(rawdata[j]))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
96 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
97 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
98 return -1 # incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
99
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
100 # Internal -- scan past the internal subset in a <!DOCTYPE declaration,
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
101 # returning the index just past any whitespace following the trailing ']'.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
102 def _parse_doctype_subset(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
103 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
104 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
105 j = i
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
106 while j < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
107 c = rawdata[j]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
108 if c == "<":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
109 s = rawdata[j:j+2]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
110 if s == "<":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
111 # end of buffer; incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
112 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
113 if s != "<!":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
114 self.updatepos(declstartpos, j + 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
115 self.error("unexpected char in internal subset (in %s)"
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 2348
diff changeset
116 % repr(s))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
117 if (j + 2) == n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
118 # end of buffer; incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
119 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
120 if (j + 4) > n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
121 # end of buffer; incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
122 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
123 if rawdata[j:j+4] == "<!--":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
124 j = self.parse_comment(j, report=0)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
125 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
126 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
127 continue
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
128 name, j = self._scan_name(j + 2, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
129 if j == -1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
130 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131 if name not in ("attlist", "element", "entity", "notation"):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132 self.updatepos(declstartpos, j + 2)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
133 self.error(
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 2348
diff changeset
134 "unknown declaration %s in internal subset" % repr(name))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
135 # handle the individual names
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
136 meth = getattr(self, "_parse_doctype_" + name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
137 j = meth(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
138 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
139 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
140 elif c == "%":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
141 # parameter entity reference
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
142 if (j + 1) == n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
143 # end of buffer; incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
144 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
145 s, j = self._scan_name(j + 1, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
146 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
147 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
148 if rawdata[j] == ";":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
149 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
150 elif c == "]":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
151 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
152 while j < n and rawdata[j] in string.whitespace:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
153 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
154 if j < n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
155 if rawdata[j] == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
156 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
157 self.updatepos(declstartpos, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
158 self.error("unexpected char after internal subset")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
159 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
160 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
161 elif c in string.whitespace:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
162 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
163 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
164 self.updatepos(declstartpos, j)
5377
12fe83f90f0d Python 3 preparation: use repr() instead of ``.
Joseph Myers <jsm@polyomino.org.uk>
parents: 2348
diff changeset
165 self.error("unexpected char %s in internal subset" % repr(c))
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
166 # end of buffer reached
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
167 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
168
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
169 # Internal -- scan past <!ELEMENT declarations
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
170 def _parse_doctype_element(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
171 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
172 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
173 name, j = self._scan_name(i, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
174 if j == -1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
175 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
176 # style content model; just skip until '>'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
177 if '>' in rawdata[j:]:
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
178 return rawdata.find(">", j) + 1
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
179 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
180
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
181 # Internal -- scan past <!ATTLIST declarations
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
182 def _parse_doctype_attlist(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
183 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
184 name, j = self._scan_name(i, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
185 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
186 if c == "":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
187 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
188 if c == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
189 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
190 while 1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
191 # scan a series of attribute descriptions; simplified:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
192 # name type [value] [#constraint]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
193 name, j = self._scan_name(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
194 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
195 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
196 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
197 if c == "":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
198 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
199 if c == "(":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
200 # an enumerated type; look for ')'
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
201 if ")" in rawdata[j:]:
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
202 j = rawdata.find(")", j) + 1
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
203 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
204 return -1
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
205 while rawdata[j:j+1].isspace():
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
206 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
207 if not rawdata[j:]:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
208 # end of buffer, incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
209 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
210 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
211 name, j = self._scan_name(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
212 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
213 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
214 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
215 if c in "'\"":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
216 m = _declstringlit_match(rawdata, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
217 if m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
218 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
219 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
220 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
221 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
222 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
223 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
224 if c == "#":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
225 if rawdata[j:] == "#":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
226 # end of buffer
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
227 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
228 name, j = self._scan_name(j + 1, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
229 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
230 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
231 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
232 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
233 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
234 if c == '>':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
235 # all done
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
236 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
237
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
238 # Internal -- scan past <!NOTATION declarations
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
239 def _parse_doctype_notation(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
240 name, j = self._scan_name(i, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
241 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
242 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
243 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
244 while 1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
245 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
246 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
247 # end of buffer; incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
248 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
249 if c == '>':
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
250 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
251 if c in "'\"":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
252 m = _declstringlit_match(rawdata, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
253 if not m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
254 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
255 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
256 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
257 name, j = self._scan_name(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
258 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
259 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
260
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
261 # Internal -- scan past <!ENTITY declarations
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
262 def _parse_doctype_entity(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
263 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
264 if rawdata[i:i+1] == "%":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
265 j = i + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
266 while 1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
267 c = rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
268 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
269 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
270 if c in string.whitespace:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
271 j = j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
272 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
273 break
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
274 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
275 j = i
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
276 name, j = self._scan_name(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
277 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
278 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
279 while 1:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
280 c = self.rawdata[j:j+1]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
281 if not c:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
282 return -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
283 if c in "'\"":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
284 m = _declstringlit_match(rawdata, j)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
285 if m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
286 j = m.end()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
287 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
288 return -1 # incomplete
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
289 elif c == ">":
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
290 return j + 1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
291 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
292 name, j = self._scan_name(j, declstartpos)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
293 if j < 0:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
294 return j
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
295
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
296 # Internal -- scan a name token and the new position and the token, or
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
297 # return -1 if we've reached the end of the buffer.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
298 def _scan_name(self, i, declstartpos):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
299 rawdata = self.rawdata
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
300 n = len(rawdata)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
301 if i == n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
302 return None, -1
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
303 m = _declname_match(rawdata, i)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
304 if m:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
305 s = m.group()
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
306 name = s.strip()
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
307 if (i + len(s)) == n:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
308 return None, -1 # end of buffer
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
309 return name.lower(), m.end()
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
310 else:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
311 self.updatepos(declstartpos, i)
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
312 self.error("expected name token")

Roundup Issue Tracker: http://roundup-tracker.org/