Mercurial > p > roundup > code
annotate roundup/indexer.py @ 681:1b2d0e702ca8 search_indexing-0-4-2-branch
Added feature [SF#526730] - search for messages capability
| author | Roche Compaan <rochecompaan@users.sourceforge.net> |
|---|---|
| date | Wed, 03 Apr 2002 11:55:57 +0000 |
| parents | |
| children | b4d13f7cc6c4 |
| rev | line source |
|---|---|
|
681
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
1 #!/usr/bin/env python |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
2 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
3 """Create full-text indexes and search them |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
4 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
5 Notes: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
6 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
7 See http://gnosis.cx/publish/programming/charming_python_15.txt |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
8 for a detailed discussion of this module. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
9 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
10 This version requires Python 1.6+. It turns out that the use |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
11 of string methods rather than [string] module functions is |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
12 enough faster in a tight loop so as to provide a quite |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
13 remarkable 25% speedup in overall indexing. However, only FOUR |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
14 lines in TextSplitter.text_splitter() were changed away from |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
15 Python 1.5 compatibility. Those lines are followed by comments |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
16 beginning with "# 1.52: " that show the old forms. Python |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
17 1.5 users can restore these lines, and comment out those just |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
18 above them. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
19 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
20 Classes: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
21 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
22 GenericIndexer -- Abstract class |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
23 TextSplitter -- Mixin class |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
24 Index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
25 ShelveIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
26 FlatIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
27 XMLPickleIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
28 PickleIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
29 ZPickleIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
30 SlicedZPickleIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
31 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
32 Functions: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
33 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
34 echo_fname(fname) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
35 recurse_files(...) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
36 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
37 Index Formats: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
38 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
39 *Indexer.files: filename --> (fileid, wordcount) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
40 *Indexer.fileids: fileid --> filename |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
41 *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
42 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
43 Module Usage: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
44 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
45 There are a few ways to use this module. Just to utilize existing |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
46 functionality, something like the following is a likely |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
47 pattern: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
48 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
49 import gnosis.indexer as indexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
50 index = indexer.MyFavoriteIndexer() # For some concrete Indexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
51 index.load_index('myIndex.db') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
52 index.add_files(dir='/this/that/otherdir', pattern='*.txt') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
53 hits = index.find(['spam','eggs','bacon']) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
54 index.print_report(hits) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
55 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
56 To customize the basic classes, something like the following is likely: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
57 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
58 class MySplitter: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
59 def splitter(self, text, ftype): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
60 "Peform much better splitting than default (for filetypes)" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
61 # ... |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
62 return words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
63 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
64 class MyIndexer(indexer.GenericIndexer, MySplitter): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
65 def load_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
66 "Retrieve three dictionaries from clever storage method" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
67 # ... |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
68 self.words, self.files, self.fileids = WORDS, FILES, FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
69 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
70 "Save three dictionaries to clever storage method" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
71 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
72 index = MyIndexer() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
73 # ...etc... |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
74 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
75 Benchmarks: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
76 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
77 As we know, there are lies, damn lies, and benchmarks. Take |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
78 the below with an adequate dose of salt. In version 0.10 of |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
79 the concrete indexers, some performance was tested. The |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
80 test case was a set of mail/news archives, that were about |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
81 43 mB, and 225 files. In each case, an index was generated |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
82 (if possible), and a search for the words "xml python" was |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
83 performed. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
84 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
85 - Index w/ PickleIndexer: 482s, 2.4 mB |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
86 - Search w/ PickleIndexer: 1.74s |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
87 - Index w/ ZPickleIndexer: 484s, 1.2 mB |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
88 - Search w/ ZPickleIndexer: 1.77s |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
89 - Index w/ FlatIndexer: 492s, 2.6 mB |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
90 - Search w/ FlatIndexer: 53s |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
91 - Index w/ ShelveIndexer: (dumbdbm) Many minutes, tens of mBs |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
92 - Search w/ ShelveIndexer: Aborted before completely indexed |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
93 - Index w/ ShelveIndexer: (dbhash) Long time (partial crash), 10 mB |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
94 - Search w/ ShelveIndexer: N/A. Too many glitches |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
95 - Index w/ XMLPickleIndexer: Memory error (xml_pickle uses bad string |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
96 composition for large output) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
97 - Search w/ XMLPickleIndexer: N/A |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
98 - grep search (xml|python): 20s (cached: <5s) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
99 - 'srch' utility (python): 12s |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
100 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
101 __shell_usage__ = """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
102 Shell Usage: [python] indexer.py [options] [search_words] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
103 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
104 -h, /h, -?, /?, ?, --help: Show this help screen |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
105 -index: Add files to index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
106 -reindex: Refresh files already in the index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
107 (can take much more time) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
108 -casesensitive: Maintain the case of indexed words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
109 (can lead to MUCH larger indices) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
110 -norecurse, -local: Only index starting dir, not subdirs |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
111 -dir=<directory>: Starting directory for indexing |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
112 (default is current directory) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
113 -indexdb=<database>: Use specified index database |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
114 (environ variable INDEXER_DB is preferred) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
115 -regex=<pattern>: Index files matching regular expression |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
116 -glob=<pattern>: Index files matching glob pattern |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
117 -filter=<pattern> Only display results matching pattern |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
118 -output=<op>, -format=<opt>: How much detail on matches? |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
119 -<digit>: Quiet level (0=verbose ... 9=quiet) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
120 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
121 Output/format options are ALL/EVERYTHING/VERBOSE, RATINGS/SCORES, |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
122 FILENAMES/NAMES/FILES, SUMMARY/REPORT""" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
123 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
124 __version__ = "$Revision: 1.1.2.1 $" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
125 __author__=["David Mertz (mertz@gnosis.cx)",] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
126 __thanks_to__=["Pat Knight (p.knight@ktgroup.co.uk)", |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
127 "Gregory Popovitch (greg@gpy.com)", ] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
128 __copyright__=""" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
129 This file is released to the public domain. I (dqm) would |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
130 appreciate it if you choose to keep derived works under terms |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
131 that promote freedom, but obviously am giving up any rights |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
132 to compel such. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
133 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
134 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
135 __history__=""" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
136 0.1 Initial version. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
137 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
138 0.11 Tweaked TextSplitter after some random experimentation. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
139 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
140 0.12 Added SlicedZPickleIndexer (best choice, so far). |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
141 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
142 0.13 Pat Knight pointed out need for binary open()'s of |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
143 certain files under Windows. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
144 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
145 0.14 Added '-filter' switch to search results. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
146 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
147 0.15 Added direct read of gzip files |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
148 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
149 0.20 Gregory Popovitch did some profiling on TextSplitter, |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
150 and provided both huge speedups to the Python version |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
151 and hooks to a C extension class (ZopeTextSplitter). |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
152 A little refactoring by he and I (dqm) has nearly |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
153 doubled the speed of indexing |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
154 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
155 0.30 Module refactored into gnosis package. This is a |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
156 first pass, and various documentation and test cases |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
157 should be added later. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
158 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
159 import string, re, os, fnmatch, sys, copy, gzip |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
160 from types import * |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
161 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
162 #-- Silly "do nothing" default recursive file processor |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
163 def echo_fname(fname): print fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
164 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
165 #-- "Recurse and process files" utility function |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
166 def recurse_files(curdir, pattern, exclusions, func=echo_fname, *args, **kw): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
167 "Recursively process file pattern" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
168 subdirs, files = [],[] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
169 level = kw.get('level',0) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
170 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
171 for name in os.listdir(curdir): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
172 fname = os.path.join(curdir, name) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
173 if name[-4:] in exclusions: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
174 pass # do not include binary file type |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
175 elif os.path.isdir(fname) and not os.path.islink(fname): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
176 subdirs.append(fname) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
177 # kludge to detect a regular expression across python versions |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
178 elif sys.version[0]=='1' and isinstance(pattern, re.RegexObject): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
179 if pattern.match(name): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
180 files.append(fname) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
181 elif sys.version[0]=='2' and type(pattern)==type(re.compile('')): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
182 if pattern.match(name): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
183 files.append(fname) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
184 elif type(pattern) is StringType: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
185 if fnmatch.fnmatch(name, pattern): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
186 files.append(fname) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
187 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
188 for fname in files: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
189 apply(func, (fname,)+args) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
190 for subdir in subdirs: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
191 recurse_files(subdir, pattern, exclusions, func, level=level+1) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
192 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
193 #-- Data bundle for index dictionaries |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
194 class Index: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
195 def __init__(self, words, files, fileids): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
196 if words is not None: self.WORDS = words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
197 if files is not None: self.FILES = files |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
198 if fileids is not None: self.FILEIDS = fileids |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
199 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
200 #-- "Split plain text into words" utility function |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
201 class TextSplitter: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
202 def initSplitter(self): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
203 prenum = string.join(map(chr, range(0,48)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
204 num2cap = string.join(map(chr, range(58,65)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
205 cap2low = string.join(map(chr, range(91,97)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
206 postlow = string.join(map(chr, range(123,256)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
207 nonword = prenum + num2cap + cap2low + postlow |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
208 self.word_only = string.maketrans(nonword, " "*len(nonword)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
209 self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
210 self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
211 self.ident = string.join(map(chr, range(256)), '') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
212 self.init = 1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
213 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
214 def splitter(self, text, ftype): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
215 "Split the contents of a text string into a list of 'words'" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
216 if ftype == 'text/plain': |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
217 words = self.text_splitter(text, self.casesensitive) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
218 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
219 raise NotImplementedError |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
220 return words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
221 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
222 def text_splitter(self, text, casesensitive=0): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
223 """Split text/plain string into a list of words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
224 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
225 In version 0.20 this function is still fairly weak at |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
226 identifying "real" words, and excluding gibberish |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
227 strings. As long as the indexer looks at "real" text |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
228 files, it does pretty well; but if indexing of binary |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
229 data is attempted, a lot of gibberish gets indexed. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
230 Suggestions on improving this are GREATLY APPRECIATED. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
231 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
232 # Initialize some constants |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
233 if not hasattr(self,'init'): self.initSplitter() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
234 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
235 # Speedup trick: attributes into local scope |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
236 word_only = self.word_only |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
237 ident = self.ident |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
238 alpha = self.alpha |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
239 nondigits = self.nondigits |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
240 translate = string.translate |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
241 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
242 # Let's adjust case if not case-sensitive |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
243 if not casesensitive: text = string.upper(text) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
244 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
245 # Split the raw text |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
246 allwords = string.split(text) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
247 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
248 # Finally, let's skip some words not worth indexing |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
249 words = [] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
250 for word in allwords: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
251 if len(word) > 25: continue # too long (probably gibberish) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
252 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
253 # Identify common patterns in non-word data (binary, UU/MIME, etc) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
254 num_nonalpha = len(word.translate(ident, alpha)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
255 numdigits = len(word.translate(ident, nondigits)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
256 # 1.52: num_nonalpha = len(translate(word, ident, alpha)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
257 # 1.52: numdigits = len(translate(word, ident, nondigits)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
258 if numdigits > len(word)-2: # almost all digits |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
259 if numdigits > 5: # too many digits is gibberish |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
260 continue # a moderate number is year/zipcode/etc |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
261 elif num_nonalpha*3 > len(word): # too much scattered nonalpha = gibberish |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
262 continue |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
263 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
264 word = word.translate(word_only) # Let's strip funny byte values |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
265 # 1.52: word = translate(word, word_only) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
266 subwords = word.split() # maybe embedded non-alphanumeric |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
267 # 1.52: subwords = string.split(word) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
268 for subword in subwords: # ...so we might have subwords |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
269 if len(subword) <= 2: continue # too short a subword |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
270 words.append(subword) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
271 return words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
272 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
273 class ZopeTextSplitter: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
274 def initSplitter(self): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
275 import Splitter |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
276 stop_words=( |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
277 'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above', 'across', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
278 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
279 'along', 'already', 'also', 'although', 'always', 'am', 'among', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
280 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
281 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
282 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
283 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
284 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
285 'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
286 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
287 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
288 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
289 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
290 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
291 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
292 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
293 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
294 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
295 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
296 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
297 'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
298 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
299 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
300 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
301 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
302 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
303 'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
304 'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
305 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
306 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
307 'somehow', 'someone', 'something', 'sometime', 'sometimes', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
308 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
309 'their', 'them', 'themselves', 'then', 'thence', 'there', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
310 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
311 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
312 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
313 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
314 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
315 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
316 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
317 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
318 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
319 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
320 ) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
321 self.stop_word_dict={} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
322 for word in stop_words: self.stop_word_dict[word]=None |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
323 self.splitterobj = Splitter.getSplitter() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
324 self.init = 1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
325 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
326 def goodword(self, word): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
327 return len(word) < 25 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
328 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
329 def splitter(self, text, ftype): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
330 """never case-sensitive""" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
331 if not hasattr(self,'init'): self.initSplitter() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
332 return filter(self.goodword, self.splitterobj(text, self.stop_word_dict)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
333 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
334 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
335 #-- "Abstract" parent class for inherited indexers |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
336 # (does not handle storage in parent, other methods are primitive) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
337 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
338 class GenericIndexer: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
339 def __init__(self, **kw): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
340 apply(self.configure, (), kw) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
341 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
342 def whoami(self): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
343 return self.__class__.__name__ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
344 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
345 def configure(self, REINDEX=0, CASESENSITIVE=0, |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
346 INDEXDB=os.environ.get('INDEXER_DB', 'TEMP_NDX.DB'), |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
347 ADD_PATTERN='*', QUIET=5): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
348 "Configure settings used by indexing and storage/retrieval" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
349 self.indexdb = INDEXDB |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
350 self.reindex = REINDEX |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
351 self.casesensitive = CASESENSITIVE |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
352 self.add_pattern = ADD_PATTERN |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
353 self.quiet = QUIET |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
354 self.filter = None |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
355 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
356 def add_files(self, dir=os.getcwd(), pattern=None, descend=1): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
357 self.load_index() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
358 exclusions = ('.zip','.pyc','.gif','.jpg','.dat','.dir') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
359 if not pattern: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
360 pattern = self.add_pattern |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
361 recurse_files(dir, pattern, exclusions, self.add_file) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
362 # Rebuild the fileid index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
363 self.fileids = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
364 for fname in self.files.keys(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
365 fileid = self.files[fname][0] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
366 self.fileids[fileid] = fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
367 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
368 def add_file(self, fname, ftype='text/plain'): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
369 "Index the contents of a regular file" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
370 if self.files.has_key(fname): # Is file eligible for (re)indexing? |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
371 if self.reindex: # Reindexing enabled, cleanup dicts |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
372 self.purge_entry(fname, self.files, self.words) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
373 else: # DO NOT reindex this file |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
374 if self.quiet < 5: print "Skipping", fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
375 return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
376 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
377 # Read in the file (if possible) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
378 try: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
379 if fname[-3:] == '.gz': |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
380 text = gzip.open(fname).read() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
381 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
382 text = open(fname).read() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
383 if self.quiet < 5: print "Indexing", fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
384 except IOError: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
385 return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
386 words = self.splitter(text, ftype) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
387 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
388 # Find new file index, and assign it to filename |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
389 # (_TOP uses trick of negative to avoid conflict with file index) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
390 self.files['_TOP'] = (self.files['_TOP'][0]-1, None) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
391 file_index = abs(self.files['_TOP'][0]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
392 self.files[fname] = (file_index, len(words)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
393 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
394 filedict = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
395 for word in words: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
396 if filedict.has_key(word): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
397 filedict[word] = filedict[word]+1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
398 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
399 filedict[word] = 1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
400 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
401 for word in filedict.keys(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
402 if self.words.has_key(word): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
403 entry = self.words[word] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
404 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
405 entry = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
406 entry[file_index] = filedict[word] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
407 self.words[word] = entry |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
408 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
409 def add_othertext(self, identifier): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
410 """Index a textual source other than a plain file |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
411 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
412 A child class might want to implement this method (or a similar one) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
413 in order to index textual sources such as SQL tables, URLs, clay |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
414 tablets, or whatever else. The identifier should uniquely pick out |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
415 the source of the text (whatever it is) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
416 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
417 raise NotImplementedError |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
418 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
419 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
420 raise NotImplementedError |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
421 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
422 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
423 raise NotImplementedError |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
424 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
425 def find(self, wordlist, print_report=0): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
426 "Locate files that match ALL the words in wordlist" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
427 self.load_index(wordlist=wordlist) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
428 entries = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
429 hits = copy.copy(self.fileids) # Copy of fileids index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
430 for word in wordlist: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
431 if not self.casesensitive: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
432 word = string.upper(word) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
433 entry = self.words.get(word) # For each word, get index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
434 entries[word] = entry # of matching files |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
435 if not entry: # Nothing for this one word (fail) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
436 return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
437 for fileid in hits.keys(): # Eliminate hits for every non-match |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
438 if not entry.has_key(fileid): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
439 del hits[fileid] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
440 if print_report: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
441 self.print_report(hits, wordlist, entries) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
442 return hits |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
443 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
444 def print_report(self, hits={}, wordlist=[], entries={}): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
445 # Figure out what to actually print (based on QUIET level) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
446 output = [] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
447 for fileid,fname in hits.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
448 message = fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
449 if self.quiet <= 3: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
450 wordcount = self.files[fname][1] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
451 matches = 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
452 countmess = '\n'+' '*13+`wordcount`+' words; ' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
453 for word in wordlist: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
454 if not self.casesensitive: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
455 word = string.upper(word) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
456 occurs = entries[word][fileid] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
457 matches = matches+occurs |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
458 countmess = countmess +`occurs`+' '+word+'; ' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
459 message = string.ljust('[RATING: ' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
460 +`1000*matches/wordcount`+']',13)+message |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
461 if self.quiet <= 2: message = message +countmess +'\n' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
462 if self.filter: # Using an output filter |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
463 if fnmatch.fnmatch(message, self.filter): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
464 output.append(message) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
465 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
466 output.append(message) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
467 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
468 if self.quiet <= 5: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
469 print string.join(output,'\n') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
470 sys.stderr.write('\n'+`len(output)`+' files matched wordlist: '+ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
471 `wordlist`+'\n') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
472 return output |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
473 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
474 def purge_entry(self, fname, file_dct, word_dct): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
475 "Remove a file from file index and word index" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
476 try: # The easy part, cleanup the file index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
477 file_index = file_dct[fname] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
478 del file_dct[fname] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
479 except KeyError: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
480 pass # We'll assume we only encounter KeyError's |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
481 # The much harder part, cleanup the word index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
482 for word, occurs in word_dct.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
483 if occurs.has_key(file_index): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
484 del occurs[file_index] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
485 word_dct[word] = occurs |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
486 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
487 def index_loaded(self): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
488 return ( hasattr(self,'fileids') and |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
489 hasattr(self,'files') and |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
490 hasattr(self,'words') ) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
491 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
492 #-- Provide an actual storage facility for the indexes (i.e. shelve) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
493 class ShelveIndexer(GenericIndexer, TextSplitter): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
494 """Concrete Indexer utilizing [shelve] for storage |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
495 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
496 Unfortunately, [shelve] proves far too slow in indexing, while |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
497 creating monstrously large indexes. Not recommend, at least under |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
498 the default dbm's tested. Also, class may be broken because |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
499 shelves do not, apparently, support the .values() and .items() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
500 methods. Fixing this is a low priority, but the sample code is |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
501 left here. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
502 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
503 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
504 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
505 import shelve |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
506 self.words = shelve.open(INDEXDB+".WORDS") |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
507 self.files = shelve.open(INDEXDB+".FILES") |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
508 self.fileids = shelve.open(INDEXDB+".FILEIDS") |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
509 if not FILES: # New index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
510 self.files['_TOP'] = (0,None) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
511 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
512 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
513 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
514 pass |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
515 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
516 class FlatIndexer(GenericIndexer, TextSplitter): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
517 """Concrete Indexer utilizing flat-file for storage |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
518 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
519 See the comments in the referenced article for details; in |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
520 brief, this indexer has about the same timing as the best in |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
521 -creating- indexes and the storage requirements are |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
522 reasonable. However, actually -using- a flat-file index is |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
523 more than an order of magnitude worse than the best indexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
524 (ZPickleIndexer wins overall). |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
525 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
526 On the other hand, FlatIndexer creates a wonderfully easy to |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
527 parse database format if you have a reason to transport the |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
528 index to a different platform or programming language. And |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
529 should you perform indexing as part of a long-running |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
530 process, the overhead of initial file parsing becomes |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
531 irrelevant. |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
532 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
533 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
534 # Unless reload is indicated, do not load twice |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
535 if self.index_loaded() and not reload: return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
536 # Ok, now let's actually load it |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
537 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
538 self.words = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
539 self.files = {'_TOP':(0,None)} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
540 self.fileids = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
541 try: # Read index contents |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
542 for line in open(INDEXDB).readlines(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
543 fields = string.split(line) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
544 if fields[0] == '-': # Read a file/fileid line |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
545 fileid = eval(fields[2]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
546 wordcount = eval(fields[3]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
547 fname = fields[1] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
548 self.files[fname] = (fileid, wordcount) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
549 self.fileids[fileid] = fname |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
550 else: # Read a word entry (dict of hits) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
551 entries = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
552 word = fields[0] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
553 for n in range(1,len(fields),2): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
554 fileid = eval(fields[n]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
555 occurs = eval(fields[n+1]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
556 entries[fileid] = occurs |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
557 self.words[word] = entries |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
558 except: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
559 pass # New index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
560 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
561 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
562 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
563 tab, lf, sp = '\t','\n',' ' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
564 indexdb = open(INDEXDB,'w') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
565 for fname,entry in self.files.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
566 indexdb.write('- '+fname +tab +`entry[0]` +tab +`entry[1]` +lf) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
567 for word,entry in self.words.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
568 indexdb.write(word +tab+tab) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
569 for fileid,occurs in entry.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
570 indexdb.write(`fileid` +sp +`occurs` +sp) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
571 indexdb.write(lf) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
572 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
573 class PickleIndexer(GenericIndexer, TextSplitter): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
574 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
575 # Unless reload is indicated, do not load twice |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
576 if self.index_loaded() and not reload: return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
577 # Ok, now let's actually load it |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
578 import cPickle |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
579 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
580 try: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
581 pickle_str = open(INDEXDB,'rb').read() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
582 db = cPickle.loads(pickle_str) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
583 except: # New index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
584 db = Index({}, {'_TOP':(0,None)}, {}) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
585 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
586 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
587 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
588 import cPickle |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
589 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
590 db = Index(self.words, self.files, self.fileids) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
591 open(INDEXDB,'wb').write(cPickle.dumps(db, 1)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
592 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
593 class XMLPickleIndexer(PickleIndexer): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
594 """Concrete Indexer utilizing XML for storage |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
595 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
596 While this is, as expected, a verbose format, the possibility |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
597 of using XML as a transport format for indexes might be |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
598 useful. However, [xml_pickle] is in need of some redesign to |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
599 avoid gross inefficiency when creating very large |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
600 (multi-megabyte) output files (fixed in [xml_pickle] version |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
601 0.48 or above) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
602 """ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
603 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
604 # Unless reload is indicated, do not load twice |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
605 if self.index_loaded() and not reload: return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
606 # Ok, now let's actually load it |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
607 from gnosis.xml.pickle import XML_Pickler |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
608 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
609 try: # XML file exists |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
610 xml_str = open(INDEXDB).read() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
611 db = XML_Pickler().loads(xml_str) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
612 except: # New index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
613 db = Index({}, {'_TOP':(0,None)}, {}) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
614 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
615 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
616 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
617 from gnosis.xml.pickle import XML_Pickler |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
618 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
619 db = Index(self.words, self.files, self.fileids) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
620 open(INDEXDB,'w').write(XML_Pickler(db).dumps()) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
621 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
622 class ZPickleIndexer(PickleIndexer): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
623 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
624 # Unless reload is indicated, do not load twice |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
625 if self.index_loaded() and not reload: return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
626 # Ok, now let's actually load it |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
627 import cPickle, zlib |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
628 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
629 try: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
630 pickle_str = zlib.decompress(open(INDEXDB+'!','rb').read()) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
631 db = cPickle.loads(pickle_str) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
632 except: # New index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
633 db = Index({}, {'_TOP':(0,None)}, {}) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
634 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
635 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
636 def save_index(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
637 import cPickle, zlib |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
638 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
639 db = Index(self.words, self.files, self.fileids) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
640 pickle_fh = open(INDEXDB+'!','wb') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
641 pickle_fh.write(zlib.compress(cPickle.dumps(db, 1))) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
642 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
643 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
644 class SlicedZPickleIndexer(ZPickleIndexer): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
645 segments = "ABCDEFGHIJKLMNOPQRSTUVWXYZ#-!" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
646 def load_index(self, INDEXDB=None, reload=0, wordlist=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
647 # Unless reload is indicated, do not load twice |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
648 if self.index_loaded() and not reload: return 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
649 # Ok, now let's actually load it |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
650 import cPickle, zlib |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
651 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
652 db = Index({}, {'_TOP':(0,None)}, {}) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
653 # Identify the relevant word-dictionary segments |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
654 if not wordlist: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
655 segments = self.segments |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
656 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
657 segments = ['-','#'] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
658 for word in wordlist: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
659 segments.append(string.upper(word[0])) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
660 # Load the segments |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
661 for segment in segments: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
662 try: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
663 pickle_str = zlib.decompress(open(INDEXDB+segment,'rb').read()) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
664 dbslice = cPickle.loads(pickle_str) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
665 if dbslice.__dict__.get('WORDS'): # If it has some words, add them |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
666 for word,entry in dbslice.WORDS.items(): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
667 db.WORDS[word] = entry |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
668 if dbslice.__dict__.get('FILES'): # If it has some files, add them |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
669 db.FILES = dbslice.FILES |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
670 if dbslice.__dict__.get('FILEIDS'): # If it has fileids, add them |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
671 db.FILEIDS = dbslice.FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
672 except: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
673 pass # No biggie, couldn't find this segment |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
674 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
675 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
676 def julienne(self, INDEXDB=None): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
677 import cPickle, zlib |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
678 INDEXDB = INDEXDB or self.indexdb |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
679 segments = self.segments # all the (little) indexes |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
680 for segment in segments: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
681 try: # brutal space saver... delete all the small segments |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
682 os.remove(INDEXDB+segment) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
683 except OSError: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
684 pass # probably just nonexistent segment index file |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
685 # First write the much simpler filename/fileid dictionaries |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
686 dbfil = Index(None, self.files, self.fileids) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
687 open(INDEXDB+'-','wb').write(zlib.compress(cPickle.dumps(dbfil,1))) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
688 # The hard part is splitting the word dictionary up, of course |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
689 letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
690 segdicts = {} # Need batch of empty dicts |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
691 for segment in letters+'#': |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
692 segdicts[segment] = {} |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
693 for word, entry in self.words.items(): # Split into segment dicts |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
694 initchar = string.upper(word[0]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
695 if initchar in letters: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
696 segdicts[initchar][word] = entry |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
697 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
698 segdicts['#'][word] = entry |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
699 for initchar in letters+'#': |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
700 db = Index(segdicts[initchar], None, None) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
701 pickle_str = cPickle.dumps(db, 1) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
702 filename = INDEXDB+initchar |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
703 pickle_fh = open(filename,'wb') |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
704 pickle_fh.write(zlib.compress(pickle_str)) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
705 os.chmod(filename,0664) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
706 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
707 save_index = julienne |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
708 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
709 PreferredIndexer = SlicedZPickleIndexer |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
710 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
711 #-- If called from command-line, parse arguments and take actions |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
712 if __name__ == '__main__': |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
713 import time |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
714 start = time.time() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
715 search_words = [] # Word search list (if specified) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
716 opts = 0 # Any options specified? |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
717 if len(sys.argv) < 2: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
718 pass # No options given |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
719 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
720 upper = string.upper |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
721 dir = os.getcwd() # Default to indexing from current directory |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
722 descend = 1 # Default to recursive indexing |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
723 ndx = PreferredIndexer() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
724 for opt in sys.argv[1:]: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
725 if opt in ('-h','/h','-?','/?','?','--help'): # help screen |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
726 print __shell_usage__ |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
727 opts = -1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
728 break |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
729 elif opt[0] in '/-': # a switch! |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
730 opts = opts+1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
731 if upper(opt[1:]) == 'INDEX': # Index files |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
732 ndx.quiet = 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
733 pass # Use defaults if no other options |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
734 elif upper(opt[1:]) == 'REINDEX': # Reindex |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
735 ndx.reindex = 1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
736 elif upper(opt[1:]) == 'CASESENSITIVE': # Case sensitive |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
737 ndx.casesensitive = 1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
738 elif upper(opt[1:]) in ('NORECURSE','LOCAL'): # No recursion |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
739 descend = 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
740 elif upper(opt[1:4]) == 'DIR': # Dir to index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
741 dir = opt[5:] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
742 elif upper(opt[1:8]) == 'INDEXDB': # Index specified |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
743 ndx.indexdb = opt[9:] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
744 sys.stderr.write( |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
745 "Use of INDEXER_DB environment variable is STRONGLY recommended.\n") |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
746 elif upper(opt[1:6]) == 'REGEX': # RegEx files to index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
747 ndx.add_pattern = re.compile(opt[7:]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
748 elif upper(opt[1:5]) == 'GLOB': # Glob files to index |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
749 ndx.add_pattern = opt[6:] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
750 elif upper(opt[1:7]) in ('OUTPUT','FORMAT'): # How should results look? |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
751 opts = opts-1 # this is not an option for indexing purposes |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
752 level = upper(opt[8:]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
753 if level in ('ALL','EVERYTHING','VERBOSE', 'MAX'): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
754 ndx.quiet = 0 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
755 elif level in ('RATINGS','SCORES','HIGH'): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
756 ndx.quiet = 3 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
757 elif level in ('FILENAMES','NAMES','FILES','MID'): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
758 ndx.quiet = 5 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
759 elif level in ('SUMMARY','MIN'): |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
760 ndx.quiet = 9 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
761 elif upper(opt[1:7]) == 'FILTER': # Regex filter output |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
762 opts = opts-1 # this is not an option for indexing purposes |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
763 ndx.filter = opt[8:] |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
764 elif opt[1:] in string.digits: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
765 opts = opts-1 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
766 ndx.quiet = eval(opt[1]) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
767 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
768 search_words.append(opt) # Search words |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
769 |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
770 if opts > 0: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
771 ndx.add_files(dir=dir) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
772 ndx.save_index() |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
773 if search_words: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
774 ndx.find(search_words, print_report=1) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
775 if not opts and not search_words: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
776 sys.stderr.write("Perhaps you would like to use the --help option?\n") |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
777 else: |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
778 sys.stderr.write('Processed in %.3f seconds (%s)' |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
779 % (time.time()-start, ndx.whoami())) |
|
1b2d0e702ca8
Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
780 |
