annotate roundup/cgi/TAL/XMLParser.py @ 5096:e74c3611b138

- issue2550636, issue2550909: Added support for Whoosh indexer. Also adds new config.ini setting called indexer to select indexer. See ``doc/upgrading.txt`` for details. Initial patch done by David Wolever. Patch modified (see ticket or below for changes), docs updated and committed. I have an outstanding issue with test/test_indexer.py. I have to comment out all imports and tests for indexers I don't have (i.e. mysql, postgres) otherwise no tests run. With that change made, dbm, sqlite (rdbms), xapian and whoosh indexes are all passing the indexer tests. Changes summary: 1) support native back ends dbm and rdbms. (original patch only fell through to dbm) 2) Developed whoosh stopfilter to not index stopwords or words outside the the maxlength and minlength limits defined in index_common.py. Required to pass the extremewords test_indexer test. Also I removed a call to .lower on the input text as the tokenizer I chose automatically does the lowercase. 3) Added support for max/min length to find. This was needed to pass extremewords test. 4) Added back a call to save_index in add_text. This allowed all but two tests to pass. 5) Fixed a call to: results = searcher.search(query.Term("identifier", identifier)) which had an extra parameter that is an error under current whoosh. 6) Set limit=None in search call for find() otherwise it only return 10 items. This allowed it to pass manyresults test Also due to changes in the roundup code removed the call in indexer_whoosh to from roundup.anypy.sets_ import set since we use the python builtin set.
author John Rouillard <rouilj@ieee.org>
date Sat, 25 Jun 2016 20:10:03 -0400
parents 8c2402a78bb0
children 88dbacd11cd1
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 ##############################################################################
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 #
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 # Copyright (c) 2001, 2002 Zope Corporation and Contributors.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 # All Rights Reserved.
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
5 #
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 # This software is subject to the provisions of the Zope Public License,
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # Version 2.0 (ZPL). A copy of the ZPL should accompany this distribution.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 # THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9 # WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10 # WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 # FOR A PARTICULAR PURPOSE
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
12 #
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 ##############################################################################
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
14 # Modifications for Roundup:
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
15 # 1. commented out zLOG references
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
16 """
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
17 Generic expat-based XML parser base class.
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
18 """
1071
c08b3820edd1 Adhering to ZPL
Richard Jones <richard@users.sourceforge.net>
parents: 1049
diff changeset
19
2348
8c2402a78bb0 beginning getting ZPT up to date: TAL first
Richard Jones <richard@users.sourceforge.net>
parents: 2005
diff changeset
20 #import zLOG
1049
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22 class XMLParser:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 ordered_attributes = 0
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 handler_names = [
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 "StartElementHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 "EndElementHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 "ProcessingInstructionHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 "CharacterDataHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 "UnparsedEntityDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 "NotationDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 "StartNamespaceDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 "EndNamespaceDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 "CommentHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 "StartCdataSectionHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 "EndCdataSectionHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 "DefaultHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 "DefaultHandlerExpand",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 "NotStandaloneHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 "ExternalEntityRefHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 "XmlDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 "StartDoctypeDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 "EndDoctypeDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 "ElementDeclHandler",
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46 "AttlistDeclHandler"
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 ]
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 def __init__(self, encoding=None):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 self.parser = p = self.createParser()
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 if self.ordered_attributes:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 try:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 self.parser.ordered_attributes = self.ordered_attributes
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 except AttributeError:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 #zLOG.LOG("TAL.XMLParser", zLOG.INFO,
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 # "Can't set ordered_attributes")
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 self.ordered_attributes = 0
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58 for name in self.handler_names:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59 method = getattr(self, name, None)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60 if method is not None:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 try:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 setattr(p, name, method)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 except AttributeError:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 #zLOG.LOG("TAL.XMLParser", zLOG.PROBLEM,
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 # "Can't set expat handler %s" % name)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66 pass
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 def createParser(self, encoding=None):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 global XMLParseError
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70 try:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 from Products.ParsedXML.Expat import pyexpat
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72 XMLParseError = pyexpat.ExpatError
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 return pyexpat.ParserCreate(encoding, ' ')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 except ImportError:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 from xml.parsers import expat
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 XMLParseError = expat.ExpatError
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 return expat.ParserCreate(encoding, ' ')
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 def parseFile(self, filename):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80 self.parseStream(open(filename))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82 def parseString(self, s):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83 self.parser.Parse(s, 1)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 def parseURL(self, url):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86 import urllib
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 self.parseStream(urllib.urlopen(url))
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89 def parseStream(self, stream):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 self.parser.ParseFile(stream)
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 def parseFragment(self, s, end=0):
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93 self.parser.Parse(s, end)

Roundup Issue Tracker: http://roundup-tracker.org/