Mercurial > p > roundup > code
annotate roundup/indexer.py @ 2077:3e0961d6d44d
Added the "actor" property.
Metakit backend not done (still not confident I know how it's supposed
to work ;)
Currently it will come up as NULL in the RDBMS backends for older items.
The *dbm backends will look up the journal. I hope to remedy the former
before 0.7's release.
Fixed a bunch of migration issues in the rdbms backends while I was at it
(index changes for key prop changes) and simplified the class table update
code for RDBMSes that have "alter table" in their command set (ie. not
sqlite) ... migration from "version 1" to "version 2" still hasn't
actually been tested yet though.
| author | Richard Jones <richard@users.sourceforge.net> |
|---|---|
| date | Mon, 15 Mar 2004 05:50:20 +0000 |
| parents | fc52d57c6c3e |
| children |
| rev | line source |
|---|---|
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
1 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
2 # This module is derived from the module described at: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
3 # http://gnosis.cx/publish/programming/charming_python_15.txt |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
4 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
5 # Author: David Mertz (mertz@gnosis.cx) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
7 # Gregory Popovitch (greg@gpy.com) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
8 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
9 # The original module was released under this license, and remains under |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
10 # it: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
11 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
12 # This file is released to the public domain. I (dqm) would |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
13 # appreciate it if you choose to keep derived works under terms |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
14 # that promote freedom, but obviously am giving up any rights |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
15 # to compel such. |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
16 # |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
17 #$Id: indexer.py,v 1.18 2004-02-11 23:55:08 richard Exp $ |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
18 '''This module provides an indexer class, RoundupIndexer, that stores text |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
19 indices in a roundup instance. This class makes searching the content of |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
20 messages, string properties and text files possible. |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
21 ''' |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
22 __docformat__ = 'restructuredtext' |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
23 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
24 import os, shutil, re, mimetypes, marshal, zlib, errno |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
25 from hyperdb import Link, Multilink |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
26 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
27 class Indexer: |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
28 '''Indexes information from roundup's hyperdb to allow efficient |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
29 searching. |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
30 |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
31 Three structures are created by the indexer:: |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
32 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
33 files {identifier: (fileid, wordcount)} |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
34 words {word: {fileid: count}} |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
35 fileids {fileid: identifier} |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
36 |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
37 where identifier is (classname, nodeid, propertyname) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
38 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
39 def __init__(self, db_path): |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
40 self.indexdb_path = os.path.join(db_path, 'indexes') |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
41 self.indexdb = os.path.join(self.indexdb_path, 'index.db') |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
42 self.reindex = 0 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
43 self.quiet = 9 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
44 self.changed = 0 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
45 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
46 # see if we need to reindex because of a change in code |
| 863 | 47 version = os.path.join(self.indexdb_path, 'version') |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
48 if (not os.path.exists(self.indexdb_path) or |
| 863 | 49 not os.path.exists(version)): |
| 50 # for now the file itself is a flag | |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
51 self.force_reindex() |
| 863 | 52 elif os.path.exists(version): |
| 53 version = open(version).read() | |
| 54 # check the value and reindex if it's not the latest | |
|
880
de3da99a7c02
Add Number and Boolean types to hyperdb.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
867
diff
changeset
|
55 if version.strip() != '1': |
| 863 | 56 self.force_reindex() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
57 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
58 def force_reindex(self): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
59 '''Force a reindex condition |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
60 ''' |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
61 if os.path.exists(self.indexdb_path): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
62 shutil.rmtree(self.indexdb_path) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
63 os.makedirs(self.indexdb_path) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
64 os.chmod(self.indexdb_path, 0775) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
65 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n') |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
66 self.reindex = 1 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
67 self.changed = 1 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
68 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
69 def should_reindex(self): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
70 '''Should we reindex? |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
71 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
72 return self.reindex |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
73 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
74 def add_text(self, identifier, text, mime_type='text/plain'): |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
75 '''Add some text associated with the (classname, nodeid, property) |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
76 identifier. |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
77 ''' |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
78 # make sure the index is loaded |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
79 self.load_index() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
80 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
81 # remove old entries for this identifier |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
82 if self.files.has_key(identifier): |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
83 self.purge_entry(identifier) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
84 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
85 # split into words |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
86 words = self.splitter(text, mime_type) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
87 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
88 # Find new file index, and assign it to identifier |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
89 # (_TOP uses trick of negative to avoid conflict with file index) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
90 self.files['_TOP'] = (self.files['_TOP'][0]-1, None) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
91 file_index = abs(self.files['_TOP'][0]) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
92 self.files[identifier] = (file_index, len(words)) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
93 self.fileids[file_index] = identifier |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
94 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
95 # find the unique words |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
96 filedict = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
97 for word in words: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
98 if filedict.has_key(word): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
99 filedict[word] = filedict[word]+1 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
100 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
101 filedict[word] = 1 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
102 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
103 # now add to the totals |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
104 for word in filedict.keys(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
105 # each word has a dict of {identifier: count} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
106 if self.words.has_key(word): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
107 entry = self.words[word] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
108 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
109 # new word |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
110 entry = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
111 self.words[word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
112 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
113 # make a reference to the file for this word |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
114 entry[file_index] = filedict[word] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
115 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
116 # save needed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
117 self.changed = 1 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
118 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
119 def splitter(self, text, ftype): |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
120 '''Split the contents of a text string into a list of 'words' |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
121 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
122 if ftype == 'text/plain': |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
123 words = self.text_splitter(text) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
124 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
125 return [] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
126 return words |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
127 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
128 def text_splitter(self, text): |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
129 """Split text/plain string into a list of words |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
130 """ |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
131 # case insensitive |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
132 text = text.upper() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
133 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
134 # Split the raw text, losing anything longer than 25 characters |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
135 # since that'll be gibberish (encoded text or somesuch) or shorter |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
136 # than 3 characters since those short words appear all over the |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
137 # place |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
138 return re.findall(r'\b\w{2,25}\b', text) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
139 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
140 def search(self, search_terms, klass, ignore={}, |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
141 dre=re.compile(r'([^\d]+)(\d+)')): |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
142 '''Display search results looking for [search, terms] associated |
|
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
143 with the hyperdb Class "klass". Ignore hits on {class: property}. |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
144 |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
145 "dre" is a helper, not an argument. |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
146 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
147 # do the index lookup |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
148 hits = self.find(search_terms) |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
149 if not hits: |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
150 return {} |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
151 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
152 designator_propname = {} |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
153 for nm, propclass in klass.getprops().items(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
154 if isinstance(propclass, Link) or isinstance(propclass, Multilink): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
155 designator_propname[propclass.classname] = nm |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
156 |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
157 # build a dictionary of nodes and their associated messages |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
158 # and files |
|
1986
910b39f8c5b8
use the upload-supplied content-type if there is one
Richard Jones <richard@users.sourceforge.net>
parents:
1376
diff
changeset
|
159 nodeids = {} # this is the answer |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
160 propspec = {} # used to do the klass.find |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
161 for propname in designator_propname.values(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
162 propspec[propname] = {} # used as a set (value doesn't matter) |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
163 for classname, nodeid, property in hits.values(): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
164 # skip this result if we don't care about this class/property |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
165 if ignore.has_key((classname, property)): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
166 continue |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
167 |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
168 # if it's a property on klass, it's easy |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
169 if classname == klass.classname: |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
170 if not nodeids.has_key(nodeid): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
171 nodeids[nodeid] = {} |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
172 continue |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
173 |
|
1206
728a0809183e
handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents:
1090
diff
changeset
|
174 # make sure the class is a linked one, otherwise ignore |
|
728a0809183e
handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents:
1090
diff
changeset
|
175 if not designator_propname.has_key(classname): |
|
728a0809183e
handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents:
1090
diff
changeset
|
176 continue |
|
728a0809183e
handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents:
1090
diff
changeset
|
177 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
178 # it's a linked class - set up to do the klass.find |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
179 linkprop = designator_propname[classname] # eg, msg -> messages |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
180 propspec[linkprop][nodeid] = 1 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
181 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
182 # retain only the meaningful entries |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
183 for propname, idset in propspec.items(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
184 if not idset: |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
185 del propspec[propname] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
186 |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
187 # klass.find tells me the klass nodeids the linked nodes relate to |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
188 for resid in klass.find(**propspec): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
189 resid = str(resid) |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
190 if not nodeids.has_key(id): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
191 nodeids[resid] = {} |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
192 node_dict = nodeids[resid] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
193 # now figure out where it came from |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
194 for linkprop in propspec.keys(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
195 for nodeid in klass.get(resid, linkprop): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
196 if propspec[linkprop].has_key(nodeid): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
197 # OK, this node[propname] has a winner |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
198 if not node_dict.has_key(linkprop): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
199 node_dict[linkprop] = [nodeid] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
200 else: |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
201 node_dict[linkprop].append(nodeid) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
202 return nodeids |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
203 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
204 # we override this to ignore not 2 < word < 25 and also to fix a bug - |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
205 # the (fail) case. |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
206 def find(self, wordlist): |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
207 '''Locate files that match ALL the words in wordlist |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
208 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
209 if not hasattr(self, 'words'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
210 self.load_index() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
211 self.load_index(wordlist=wordlist) |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
212 entries = {} |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
213 hits = None |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
214 for word in wordlist: |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
215 if not 2 < len(word) < 25: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
216 # word outside the bounds of what we index - ignore |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
217 continue |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
218 word = word.upper() |
|
1376
0c736e2f1dd5
.get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents:
1365
diff
changeset
|
219 entry = self.words.get(word) # For each word, get index |
|
0c736e2f1dd5
.get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents:
1365
diff
changeset
|
220 entries[word] = entry # of matching files |
|
0c736e2f1dd5
.get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents:
1365
diff
changeset
|
221 if not entry: # Nothing for this one word (fail) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
222 return {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
223 if hits is None: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
224 hits = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
225 for k in entry.keys(): |
|
1365
4884fb0860f9
fixed rdbms searching by ID [SF#666615]
Richard Jones <richard@users.sourceforge.net>
parents:
1206
diff
changeset
|
226 if not self.fileids.has_key(k): |
|
4884fb0860f9
fixed rdbms searching by ID [SF#666615]
Richard Jones <richard@users.sourceforge.net>
parents:
1206
diff
changeset
|
227 raise ValueError, 'Index is corrupted: re-generate it' |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
228 hits[k] = self.fileids[k] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
229 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
230 # Eliminate hits for every non-match |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
231 for fileid in hits.keys(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
232 if not entry.has_key(fileid): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
233 del hits[fileid] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
234 if hits is None: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
235 return {} |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
236 return hits |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
237 |
|
827
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
238 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!" |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
239 def load_index(self, reload=0, wordlist=None): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
240 # Unless reload is indicated, do not load twice |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
241 if self.index_loaded() and not reload: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
242 return 0 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
243 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
244 # Ok, now let's actually load it |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
245 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
246 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
247 # Identify the relevant word-dictionary segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
248 if not wordlist: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
249 segments = self.segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
250 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
251 segments = ['-','#'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
252 for word in wordlist: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
253 segments.append(word[0].upper()) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
254 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
255 # Load the segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
256 for segment in segments: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
257 try: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
258 f = open(self.indexdb + segment, 'rb') |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
259 except IOError, error: |
| 863 | 260 # probably just nonexistent segment index file |
| 261 if error.errno != errno.ENOENT: raise | |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
262 else: |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
263 pickle_str = zlib.decompress(f.read()) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
264 f.close() |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
265 dbslice = marshal.loads(pickle_str) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
266 if dbslice.get('WORDS'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
267 # if it has some words, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
268 for word, entry in dbslice['WORDS'].items(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
269 db['WORDS'][word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
270 if dbslice.get('FILES'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
271 # if it has some files, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
272 db['FILES'] = dbslice['FILES'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
273 if dbslice.get('FILEIDS'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
274 # if it has fileids, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
275 db['FILEIDS'] = dbslice['FILEIDS'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
276 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
277 self.words = db['WORDS'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
278 self.files = db['FILES'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
279 self.fileids = db['FILEIDS'] |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
280 self.changed = 0 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
281 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
282 def save_index(self): |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
283 # only save if the index is loaded and changed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
284 if not self.index_loaded() or not self.changed: |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
285 return |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
286 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
287 # brutal space saver... delete all the small segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
288 for segment in self.segments: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
289 try: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
290 os.remove(self.indexdb + segment) |
| 863 | 291 except OSError, error: |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
292 # probably just nonexistent segment index file |
| 867 | 293 if error.errno != errno.ENOENT: raise |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
294 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
295 # First write the much simpler filename/fileid dictionaries |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
296 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
297 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil))) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
298 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
299 # The hard part is splitting the word dictionary up, of course |
|
827
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
300 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_" |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
301 segdicts = {} # Need batch of empty dicts |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
302 for segment in letters: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
303 segdicts[segment] = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
304 for word, entry in self.words.items(): # Split into segment dicts |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
305 initchar = word[0].upper() |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
306 segdicts[initchar][word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
307 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
308 # save |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
309 for initchar in letters: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
310 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
311 pickle_str = marshal.dumps(db) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
312 filename = self.indexdb + initchar |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
313 pickle_fh = open(filename, 'wb') |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
314 pickle_fh.write(zlib.compress(pickle_str)) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
315 os.chmod(filename, 0664) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
316 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
317 # save done |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
318 self.changed = 0 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
319 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
320 def purge_entry(self, identifier): |
|
2005
fc52d57c6c3e
documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents:
1986
diff
changeset
|
321 '''Remove a file from file index and word index |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
322 ''' |
|
891
974a4b94c5e3
Implemented the destroy() method needed by the session database...
Richard Jones <richard@users.sourceforge.net>
parents:
880
diff
changeset
|
323 self.load_index() |
|
974a4b94c5e3
Implemented the destroy() method needed by the session database...
Richard Jones <richard@users.sourceforge.net>
parents:
880
diff
changeset
|
324 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
325 if not self.files.has_key(identifier): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
326 return |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
327 |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
328 file_index = self.files[identifier][0] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
329 del self.files[identifier] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
330 del self.fileids[file_index] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
331 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
332 # The much harder part, cleanup the word index |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
333 for key, occurs in self.words.items(): |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
334 if occurs.has_key(file_index): |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
335 del occurs[file_index] |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
336 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
337 # save needed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
338 self.changed = 1 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
339 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
340 def index_loaded(self): |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
341 return (hasattr(self,'fileids') and hasattr(self,'files') and |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
342 hasattr(self,'words')) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
343 |
| 1090 | 344 # vim: set filetype=python ts=4 sw=4 et si |
