annotate roundup/backends/indexer_dbm.py @ 5973:fe334430ca07

issue2550919 - Anti-bot signup using 4 second delay Took the code by erik forsberg and massaged it into the core. So this is no longer needed in the tracker. Updated devel and responsive trackers to remove timestamp.py and update input field name. Docs, changes and tests complete. Hopefully these tracker changes won't cause an issue for other tests.
author John Rouillard <rouilj@ieee.org>
date Sat, 09 Nov 2019 00:30:37 -0500
parents 8e4c5db44fde
children 3175bb92ca28
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 # This module is derived from the module described at:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 # http://gnosis.cx/publish/programming/charming_python_15.txt
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 # Author: David Mertz (mertz@gnosis.cx)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # Gregory Popovitch (greg@gpy.com)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9 # The original module was released under this license, and remains under
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10 # it:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 # This file is released to the public domain. I (dqm) would
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 # appreciate it if you choose to keep derived works under terms
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14 # that promote freedom, but obviously am giving up any rights
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15 # to compel such.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 '''This module provides an indexer class, RoundupIndexer, that stores text
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 indices in a roundup instance. This class makes searching the content of
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19 messages, string properties and text files possible.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 __docformat__ = 'restructuredtext'
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23 import os, shutil, re, mimetypes, marshal, zlib, errno
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 from roundup.hyperdb import Link, Multilink
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
25 from roundup.backends.indexer_common import Indexer as IndexerBase
2872
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2089
diff changeset
26
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
27 class Indexer(IndexerBase):
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 '''Indexes information from roundup's hyperdb to allow efficient
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 searching.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 Three structures are created by the indexer::
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 files {identifier: (fileid, wordcount)}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 words {word: {fileid: count}}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 fileids {fileid: identifier}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 where identifier is (classname, nodeid, propertyname)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 '''
3295
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
39 def __init__(self, db):
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
40 IndexerBase.__init__(self, db)
3295
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
41 self.indexdb_path = os.path.join(db.config.DATABASE, 'indexes')
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 self.indexdb = os.path.join(self.indexdb_path, 'index.db')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 self.reindex = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 self.quiet = 9
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 # see if we need to reindex because of a change in code
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48 version = os.path.join(self.indexdb_path, 'version')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 if (not os.path.exists(self.indexdb_path) or
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 not os.path.exists(version)):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 # for now the file itself is a flag
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 self.force_reindex()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 elif os.path.exists(version):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 version = open(version).read()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 # check the value and reindex if it's not the latest
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 if version.strip() != '1':
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 self.force_reindex()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59 def force_reindex(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60 '''Force a reindex condition
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 if os.path.exists(self.indexdb_path):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 shutil.rmtree(self.indexdb_path)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 os.makedirs(self.indexdb_path)
5380
64c4e43fbb84 Python 3 preparation: numeric literal syntax.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5248
diff changeset
65 os.chmod(self.indexdb_path, 0o775)
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67 self.reindex = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70 def should_reindex(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 '''Should we reindex?
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 return self.reindex
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 def add_text(self, identifier, text, mime_type='text/plain'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 '''Add some text associated with the (classname, nodeid, property)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 identifier.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 # make sure the index is loaded
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82 # remove old entries for this identifier
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
83 if identifier in self.files:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 self.purge_entry(identifier)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86 # split into words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 words = self.splitter(text, mime_type)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89 # Find new file index, and assign it to identifier
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 # (_TOP uses trick of negative to avoid conflict with file index)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 file_index = abs(self.files['_TOP'][0])
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93 self.files[identifier] = (file_index, len(words))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
94 self.fileids[file_index] = identifier
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
95
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
96 # find the unique words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
97 filedict = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
98 for word in words:
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
99 if self.is_stopword(word):
2872
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2089
diff changeset
100 continue
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
101 if word in filedict:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
102 filedict[word] = filedict[word]+1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
103 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
104 filedict[word] = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
105
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
106 # now add to the totals
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
107 for word in filedict:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
108 # each word has a dict of {identifier: count}
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
109 if word in self.words:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
110 entry = self.words[word]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
111 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
112 # new word
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
113 entry = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
114 self.words[word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
115
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
116 # make a reference to the file for this word
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
117 entry[file_index] = filedict[word]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
118
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
119 # save needed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
120 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
121
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
122 def splitter(self, text, ftype):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
123 '''Split the contents of a text string into a list of 'words'
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
124 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
125 if ftype == 'text/plain':
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
126 words = self.text_splitter(text)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
127 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
128 return []
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
129 return words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
130
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131 def text_splitter(self, text):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132 """Split text/plain string into a list of words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
133 """
5966
8e4c5db44fde Handle memory db indexer test
John Rouillard <rouilj@ieee.org>
parents: 5963
diff changeset
134 if not text:
8e4c5db44fde Handle memory db indexer test
John Rouillard <rouilj@ieee.org>
parents: 5963
diff changeset
135 return []
8e4c5db44fde Handle memory db indexer test
John Rouillard <rouilj@ieee.org>
parents: 5963
diff changeset
136
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
137 # case insensitive
5963
4c7662c86a36 fixed the dbm indexer test for unicode under python2.
John Rouillard <rouilj@ieee.org>
parents: 5470
diff changeset
138 text = text.upper()
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
139
4252
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
140 # Split the raw text
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
141 return re.findall(r'\b\w{%d,%d}\b' % (self.minlength, self.maxlength),
5963
4c7662c86a36 fixed the dbm indexer test for unicode under python2.
John Rouillard <rouilj@ieee.org>
parents: 5470
diff changeset
142 text, re.UNICODE)
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
143
4252
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
144 # we override this to ignore too short and too long words
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
145 # and also to fix a bug - the (fail) case.
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
146 def find(self, wordlist):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
147 '''Locate files that match ALL the words in wordlist
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
148 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
149 if not hasattr(self, 'words'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
150 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
151 self.load_index(wordlist=wordlist)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
152 entries = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
153 hits = None
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
154 for word in wordlist:
4252
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
155 if not self.minlength <= len(word) <= self.maxlength:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
156 # word outside the bounds of what we index - ignore
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
157 continue
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
158 word = word.upper()
4252
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
159 if self.is_stopword(word):
2ff6f39aa391 Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents: 3613
diff changeset
160 continue
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
161 entry = self.words.get(word) # For each word, get index
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
162 entries[word] = entry # of matching files
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
163 if not entry: # Nothing for this one word (fail)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
164 return {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
165 if hits is None:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
166 hits = {}
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
167 for k in entry:
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
168 if k not in self.fileids:
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
169 raise ValueError('Index is corrupted: re-generate it')
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
170 hits[k] = self.fileids[k]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
171 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
172 # Eliminate hits for every non-match
4362
74476eaac38a more modernisation
Richard Jones <richard@users.sourceforge.net>
parents: 4357
diff changeset
173 for fileid in list(hits):
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
174 if fileid not in entry:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
175 del hits[fileid]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
176 if hits is None:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
177 return {}
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
178 return list(hits.values())
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
179
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
180 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!"
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
181 def load_index(self, reload=0, wordlist=None):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
182 # Unless reload is indicated, do not load twice
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
183 if self.index_loaded() and not reload:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
184 return 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
185
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
186 # Ok, now let's actually load it
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
187 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
188
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
189 # Identify the relevant word-dictionary segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
190 if not wordlist:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
191 segments = self.segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
192 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
193 segments = ['-','#']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
194 for word in wordlist:
5470
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
195 initchar = word[0].upper()
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
196 if initchar not in self.segments:
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
197 initchar = '_'
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
198 segments.append(initchar)
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
199
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
200 # Load the segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
201 for segment in segments:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
202 try:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
203 f = open(self.indexdb + segment, 'rb')
5248
198b6e810c67 Use Python-3-compatible 'as' syntax for except statements
Eric S. Raymond <esr@thyrsus.com>
parents: 4570
diff changeset
204 except IOError as error:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
205 # probably just nonexistent segment index file
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
206 if error.errno != errno.ENOENT: raise
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
207 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
208 pickle_str = zlib.decompress(f.read())
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
209 f.close()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
210 dbslice = marshal.loads(pickle_str)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
211 if dbslice.get('WORDS'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
212 # if it has some words, add them
5395
23b8e6067f7c Python 3 preparation: update calls to dict methods.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5380
diff changeset
213 for word, entry in dbslice['WORDS'].items():
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
214 db['WORDS'][word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
215 if dbslice.get('FILES'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
216 # if it has some files, add them
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
217 db['FILES'] = dbslice['FILES']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
218 if dbslice.get('FILEIDS'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
219 # if it has fileids, add them
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
220 db['FILEIDS'] = dbslice['FILEIDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
221
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
222 self.words = db['WORDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
223 self.files = db['FILES']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
224 self.fileids = db['FILEIDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
225 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
226
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
227 def save_index(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
228 # only save if the index is loaded and changed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
229 if not self.index_loaded() or not self.changed:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
230 return
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
231
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
232 # brutal space saver... delete all the small segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
233 for segment in self.segments:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
234 try:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
235 os.remove(self.indexdb + segment)
5248
198b6e810c67 Use Python-3-compatible 'as' syntax for except statements
Eric S. Raymond <esr@thyrsus.com>
parents: 4570
diff changeset
236 except OSError as error:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
237 # probably just nonexistent segment index file
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
238 if error.errno != errno.ENOENT: raise
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
239
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
240 # First write the much simpler filename/fileid dictionaries
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
241 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
242 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil)))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
243
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
244 # The hard part is splitting the word dictionary up, of course
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
245 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_"
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
246 segdicts = {} # Need batch of empty dicts
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
247 for segment in letters:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
248 segdicts[segment] = {}
5395
23b8e6067f7c Python 3 preparation: update calls to dict methods.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5380
diff changeset
249 for word, entry in self.words.items(): # Split into segment dicts
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
250 initchar = word[0].upper()
5470
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
251 if initchar not in letters:
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
252 # if it's a unicode character, add it to the '_' segment
e2baa4e6ed6d handle words starting with unicode characters
Christof Meerwald <cmeerw@cmeerw.org>
parents: 5395
diff changeset
253 initchar = '_'
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
254 segdicts[initchar][word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
255
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
256 # save
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
257 for initchar in letters:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
258 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
259 pickle_str = marshal.dumps(db)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
260 filename = self.indexdb + initchar
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
261 pickle_fh = open(filename, 'wb')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
262 pickle_fh.write(zlib.compress(pickle_str))
5380
64c4e43fbb84 Python 3 preparation: numeric literal syntax.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5248
diff changeset
263 os.chmod(filename, 0o664)
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
264
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
265 # save done
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
266 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
267
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
268 def purge_entry(self, identifier):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
269 '''Remove a file from file index and word index
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
270 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
271 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
272
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
273 if identifier not in self.files:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
274 return
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
275
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
276 file_index = self.files[identifier][0]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
277 del self.files[identifier]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
278 del self.fileids[file_index]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
279
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
280 # The much harder part, cleanup the word index
5395
23b8e6067f7c Python 3 preparation: update calls to dict methods.
Joseph Myers <jsm@polyomino.org.uk>
parents: 5380
diff changeset
281 for key, occurs in self.words.items():
4357
13b3155869e0 Beginnings of a big code cleanup / modernisation to make 2to3 happy
Richard Jones <richard@users.sourceforge.net>
parents: 4252
diff changeset
282 if file_index in occurs:
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
283 del occurs[file_index]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
284
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
285 # save needed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
286 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
287
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
288 def index_loaded(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
289 return (hasattr(self,'fileids') and hasattr(self,'files') and
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
290 hasattr(self,'words'))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
291
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
292 def rollback(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
293 ''' load last saved index info. '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
294 self.load_index(reload=1)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
295
3613
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
296 def close(self):
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
297 pass
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
298
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
299
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
300 # vim: set filetype=python ts=4 sw=4 et si

Roundup Issue Tracker: http://roundup-tracker.org/