annotate roundup/indexer.py @ 682:b4d13f7cc6c4 search_indexing-0-4-2-branch

Oops. Forgot to include cvs keywords in file.
author Roche Compaan <rochecompaan@users.sourceforge.net>
date Wed, 03 Apr 2002 12:01:55 +0000
parents 1b2d0e702ca8
children 7f5b51ffe92d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
681
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
1 #!/usr/bin/env python
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
2
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
3 """Create full-text indexes and search them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
4
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
5 Notes:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
6
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
7 See http://gnosis.cx/publish/programming/charming_python_15.txt
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
8 for a detailed discussion of this module.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
9
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
10 This version requires Python 1.6+. It turns out that the use
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
11 of string methods rather than [string] module functions is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
12 enough faster in a tight loop so as to provide a quite
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
13 remarkable 25% speedup in overall indexing. However, only FOUR
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
14 lines in TextSplitter.text_splitter() were changed away from
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
15 Python 1.5 compatibility. Those lines are followed by comments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
16 beginning with "# 1.52: " that show the old forms. Python
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
17 1.5 users can restore these lines, and comment out those just
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
18 above them.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
19
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
20 Classes:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
21
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
22 GenericIndexer -- Abstract class
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
23 TextSplitter -- Mixin class
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
24 Index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
25 ShelveIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
26 FlatIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
27 XMLPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
28 PickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
29 ZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
30 SlicedZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
31
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
32 Functions:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
33
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
34 echo_fname(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
35 recurse_files(...)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
36
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
37 Index Formats:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
38
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
39 *Indexer.files: filename --> (fileid, wordcount)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
40 *Indexer.fileids: fileid --> filename
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
41 *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
42
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
43 Module Usage:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
44
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
45 There are a few ways to use this module. Just to utilize existing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
46 functionality, something like the following is a likely
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
47 pattern:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
48
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
49 import gnosis.indexer as indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
50 index = indexer.MyFavoriteIndexer() # For some concrete Indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
51 index.load_index('myIndex.db')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
52 index.add_files(dir='/this/that/otherdir', pattern='*.txt')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
53 hits = index.find(['spam','eggs','bacon'])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
54 index.print_report(hits)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
55
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
56 To customize the basic classes, something like the following is likely:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
57
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
58 class MySplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
59 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
60 "Peform much better splitting than default (for filetypes)"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
61 # ...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
62 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
63
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
64 class MyIndexer(indexer.GenericIndexer, MySplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
65 def load_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
66 "Retrieve three dictionaries from clever storage method"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
67 # ...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
68 self.words, self.files, self.fileids = WORDS, FILES, FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
69 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
70 "Save three dictionaries to clever storage method"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
71
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
72 index = MyIndexer()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
73 # ...etc...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
74
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
75 Benchmarks:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
76
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
77 As we know, there are lies, damn lies, and benchmarks. Take
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
78 the below with an adequate dose of salt. In version 0.10 of
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
79 the concrete indexers, some performance was tested. The
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
80 test case was a set of mail/news archives, that were about
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
81 43 mB, and 225 files. In each case, an index was generated
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
82 (if possible), and a search for the words "xml python" was
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
83 performed.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
84
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
85 - Index w/ PickleIndexer: 482s, 2.4 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
86 - Search w/ PickleIndexer: 1.74s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
87 - Index w/ ZPickleIndexer: 484s, 1.2 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
88 - Search w/ ZPickleIndexer: 1.77s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
89 - Index w/ FlatIndexer: 492s, 2.6 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
90 - Search w/ FlatIndexer: 53s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
91 - Index w/ ShelveIndexer: (dumbdbm) Many minutes, tens of mBs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
92 - Search w/ ShelveIndexer: Aborted before completely indexed
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
93 - Index w/ ShelveIndexer: (dbhash) Long time (partial crash), 10 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
94 - Search w/ ShelveIndexer: N/A. Too many glitches
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
95 - Index w/ XMLPickleIndexer: Memory error (xml_pickle uses bad string
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
96 composition for large output)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
97 - Search w/ XMLPickleIndexer: N/A
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
98 - grep search (xml|python): 20s (cached: <5s)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
99 - 'srch' utility (python): 12s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
100 """
682
b4d13f7cc6c4 Oops. Forgot to include cvs keywords in file.
Roche Compaan <rochecompaan@users.sourceforge.net>
parents: 681
diff changeset
101 #$Id: indexer.py,v 1.1.2.2 2002-04-03 12:01:55 rochecompaan Exp $
b4d13f7cc6c4 Oops. Forgot to include cvs keywords in file.
Roche Compaan <rochecompaan@users.sourceforge.net>
parents: 681
diff changeset
102
681
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
103 __shell_usage__ = """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
104 Shell Usage: [python] indexer.py [options] [search_words]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
105
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
106 -h, /h, -?, /?, ?, --help: Show this help screen
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
107 -index: Add files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
108 -reindex: Refresh files already in the index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
109 (can take much more time)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
110 -casesensitive: Maintain the case of indexed words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
111 (can lead to MUCH larger indices)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
112 -norecurse, -local: Only index starting dir, not subdirs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
113 -dir=<directory>: Starting directory for indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
114 (default is current directory)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
115 -indexdb=<database>: Use specified index database
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
116 (environ variable INDEXER_DB is preferred)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
117 -regex=<pattern>: Index files matching regular expression
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
118 -glob=<pattern>: Index files matching glob pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
119 -filter=<pattern> Only display results matching pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
120 -output=<op>, -format=<opt>: How much detail on matches?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
121 -<digit>: Quiet level (0=verbose ... 9=quiet)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
122
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
123 Output/format options are ALL/EVERYTHING/VERBOSE, RATINGS/SCORES,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
124 FILENAMES/NAMES/FILES, SUMMARY/REPORT"""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
125
682
b4d13f7cc6c4 Oops. Forgot to include cvs keywords in file.
Roche Compaan <rochecompaan@users.sourceforge.net>
parents: 681
diff changeset
126 __version__ = "$Revision: 1.1.2.2 $"
681
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
127 __author__=["David Mertz (mertz@gnosis.cx)",]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
128 __thanks_to__=["Pat Knight (p.knight@ktgroup.co.uk)",
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
129 "Gregory Popovitch (greg@gpy.com)", ]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
130 __copyright__="""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
131 This file is released to the public domain. I (dqm) would
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
132 appreciate it if you choose to keep derived works under terms
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
133 that promote freedom, but obviously am giving up any rights
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
134 to compel such.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
135 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
136
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
137 __history__="""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
138 0.1 Initial version.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
139
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
140 0.11 Tweaked TextSplitter after some random experimentation.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
141
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
142 0.12 Added SlicedZPickleIndexer (best choice, so far).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
143
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
144 0.13 Pat Knight pointed out need for binary open()'s of
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
145 certain files under Windows.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
146
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
147 0.14 Added '-filter' switch to search results.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
148
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
149 0.15 Added direct read of gzip files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
150
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
151 0.20 Gregory Popovitch did some profiling on TextSplitter,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
152 and provided both huge speedups to the Python version
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
153 and hooks to a C extension class (ZopeTextSplitter).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
154 A little refactoring by he and I (dqm) has nearly
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
155 doubled the speed of indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
156
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
157 0.30 Module refactored into gnosis package. This is a
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
158 first pass, and various documentation and test cases
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
159 should be added later.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
160 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
161 import string, re, os, fnmatch, sys, copy, gzip
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
162 from types import *
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
163
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
164 #-- Silly "do nothing" default recursive file processor
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
165 def echo_fname(fname): print fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
166
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
167 #-- "Recurse and process files" utility function
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
168 def recurse_files(curdir, pattern, exclusions, func=echo_fname, *args, **kw):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
169 "Recursively process file pattern"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
170 subdirs, files = [],[]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
171 level = kw.get('level',0)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
172
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
173 for name in os.listdir(curdir):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
174 fname = os.path.join(curdir, name)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
175 if name[-4:] in exclusions:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
176 pass # do not include binary file type
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
177 elif os.path.isdir(fname) and not os.path.islink(fname):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
178 subdirs.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
179 # kludge to detect a regular expression across python versions
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
180 elif sys.version[0]=='1' and isinstance(pattern, re.RegexObject):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
181 if pattern.match(name):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
182 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
183 elif sys.version[0]=='2' and type(pattern)==type(re.compile('')):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
184 if pattern.match(name):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
185 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
186 elif type(pattern) is StringType:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
187 if fnmatch.fnmatch(name, pattern):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
188 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
189
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
190 for fname in files:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
191 apply(func, (fname,)+args)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
192 for subdir in subdirs:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
193 recurse_files(subdir, pattern, exclusions, func, level=level+1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
194
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
195 #-- Data bundle for index dictionaries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
196 class Index:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
197 def __init__(self, words, files, fileids):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
198 if words is not None: self.WORDS = words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
199 if files is not None: self.FILES = files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
200 if fileids is not None: self.FILEIDS = fileids
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
201
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
202 #-- "Split plain text into words" utility function
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
203 class TextSplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
204 def initSplitter(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
205 prenum = string.join(map(chr, range(0,48)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
206 num2cap = string.join(map(chr, range(58,65)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
207 cap2low = string.join(map(chr, range(91,97)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
208 postlow = string.join(map(chr, range(123,256)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
209 nonword = prenum + num2cap + cap2low + postlow
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
210 self.word_only = string.maketrans(nonword, " "*len(nonword))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
211 self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
212 self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
213 self.ident = string.join(map(chr, range(256)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
214 self.init = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
215
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
216 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
217 "Split the contents of a text string into a list of 'words'"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
218 if ftype == 'text/plain':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
219 words = self.text_splitter(text, self.casesensitive)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
220 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
221 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
222 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
223
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
224 def text_splitter(self, text, casesensitive=0):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
225 """Split text/plain string into a list of words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
226
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
227 In version 0.20 this function is still fairly weak at
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
228 identifying "real" words, and excluding gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
229 strings. As long as the indexer looks at "real" text
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
230 files, it does pretty well; but if indexing of binary
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
231 data is attempted, a lot of gibberish gets indexed.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
232 Suggestions on improving this are GREATLY APPRECIATED.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
233 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
234 # Initialize some constants
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
235 if not hasattr(self,'init'): self.initSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
236
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
237 # Speedup trick: attributes into local scope
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
238 word_only = self.word_only
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
239 ident = self.ident
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
240 alpha = self.alpha
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
241 nondigits = self.nondigits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
242 translate = string.translate
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
243
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
244 # Let's adjust case if not case-sensitive
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
245 if not casesensitive: text = string.upper(text)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
246
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
247 # Split the raw text
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
248 allwords = string.split(text)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
249
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
250 # Finally, let's skip some words not worth indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
251 words = []
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
252 for word in allwords:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
253 if len(word) > 25: continue # too long (probably gibberish)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
254
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
255 # Identify common patterns in non-word data (binary, UU/MIME, etc)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
256 num_nonalpha = len(word.translate(ident, alpha))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
257 numdigits = len(word.translate(ident, nondigits))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
258 # 1.52: num_nonalpha = len(translate(word, ident, alpha))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
259 # 1.52: numdigits = len(translate(word, ident, nondigits))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
260 if numdigits > len(word)-2: # almost all digits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
261 if numdigits > 5: # too many digits is gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
262 continue # a moderate number is year/zipcode/etc
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
263 elif num_nonalpha*3 > len(word): # too much scattered nonalpha = gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
264 continue
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
265
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
266 word = word.translate(word_only) # Let's strip funny byte values
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
267 # 1.52: word = translate(word, word_only)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
268 subwords = word.split() # maybe embedded non-alphanumeric
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
269 # 1.52: subwords = string.split(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
270 for subword in subwords: # ...so we might have subwords
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
271 if len(subword) <= 2: continue # too short a subword
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
272 words.append(subword)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
273 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
274
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
275 class ZopeTextSplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
276 def initSplitter(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
277 import Splitter
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
278 stop_words=(
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
279 'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above', 'across',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
280 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
281 'along', 'already', 'also', 'although', 'always', 'am', 'among',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
282 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
283 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
284 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
285 'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
286 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
287 'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
288 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
289 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
290 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
291 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
292 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
293 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
294 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
295 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
296 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
297 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
298 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
299 'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
300 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
301 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
302 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
303 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
304 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
305 'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
306 'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
307 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
308 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
309 'somehow', 'someone', 'something', 'sometime', 'sometimes',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
310 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
311 'their', 'them', 'themselves', 'then', 'thence', 'there',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
312 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
313 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
314 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
315 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
316 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
317 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
318 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
319 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
320 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
321 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
322 )
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
323 self.stop_word_dict={}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
324 for word in stop_words: self.stop_word_dict[word]=None
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
325 self.splitterobj = Splitter.getSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
326 self.init = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
327
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
328 def goodword(self, word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
329 return len(word) < 25
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
330
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
331 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
332 """never case-sensitive"""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
333 if not hasattr(self,'init'): self.initSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
334 return filter(self.goodword, self.splitterobj(text, self.stop_word_dict))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
335
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
336
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
337 #-- "Abstract" parent class for inherited indexers
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
338 # (does not handle storage in parent, other methods are primitive)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
339
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
340 class GenericIndexer:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
341 def __init__(self, **kw):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
342 apply(self.configure, (), kw)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
343
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
344 def whoami(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
345 return self.__class__.__name__
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
346
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
347 def configure(self, REINDEX=0, CASESENSITIVE=0,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
348 INDEXDB=os.environ.get('INDEXER_DB', 'TEMP_NDX.DB'),
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
349 ADD_PATTERN='*', QUIET=5):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
350 "Configure settings used by indexing and storage/retrieval"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
351 self.indexdb = INDEXDB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
352 self.reindex = REINDEX
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
353 self.casesensitive = CASESENSITIVE
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
354 self.add_pattern = ADD_PATTERN
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
355 self.quiet = QUIET
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
356 self.filter = None
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
357
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
358 def add_files(self, dir=os.getcwd(), pattern=None, descend=1):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
359 self.load_index()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
360 exclusions = ('.zip','.pyc','.gif','.jpg','.dat','.dir')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
361 if not pattern:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
362 pattern = self.add_pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
363 recurse_files(dir, pattern, exclusions, self.add_file)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
364 # Rebuild the fileid index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
365 self.fileids = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
366 for fname in self.files.keys():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
367 fileid = self.files[fname][0]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
368 self.fileids[fileid] = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
369
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
370 def add_file(self, fname, ftype='text/plain'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
371 "Index the contents of a regular file"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
372 if self.files.has_key(fname): # Is file eligible for (re)indexing?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
373 if self.reindex: # Reindexing enabled, cleanup dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
374 self.purge_entry(fname, self.files, self.words)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
375 else: # DO NOT reindex this file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
376 if self.quiet < 5: print "Skipping", fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
377 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
378
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
379 # Read in the file (if possible)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
380 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
381 if fname[-3:] == '.gz':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
382 text = gzip.open(fname).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
383 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
384 text = open(fname).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
385 if self.quiet < 5: print "Indexing", fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
386 except IOError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
387 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
388 words = self.splitter(text, ftype)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
389
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
390 # Find new file index, and assign it to filename
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
391 # (_TOP uses trick of negative to avoid conflict with file index)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
392 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
393 file_index = abs(self.files['_TOP'][0])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
394 self.files[fname] = (file_index, len(words))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
395
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
396 filedict = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
397 for word in words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
398 if filedict.has_key(word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
399 filedict[word] = filedict[word]+1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
400 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
401 filedict[word] = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
402
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
403 for word in filedict.keys():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
404 if self.words.has_key(word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
405 entry = self.words[word]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
406 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
407 entry = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
408 entry[file_index] = filedict[word]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
409 self.words[word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
410
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
411 def add_othertext(self, identifier):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
412 """Index a textual source other than a plain file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
413
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
414 A child class might want to implement this method (or a similar one)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
415 in order to index textual sources such as SQL tables, URLs, clay
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
416 tablets, or whatever else. The identifier should uniquely pick out
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
417 the source of the text (whatever it is)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
418 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
419 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
420
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
421 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
422 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
423
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
424 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
425 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
426
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
427 def find(self, wordlist, print_report=0):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
428 "Locate files that match ALL the words in wordlist"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
429 self.load_index(wordlist=wordlist)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
430 entries = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
431 hits = copy.copy(self.fileids) # Copy of fileids index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
432 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
433 if not self.casesensitive:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
434 word = string.upper(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
435 entry = self.words.get(word) # For each word, get index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
436 entries[word] = entry # of matching files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
437 if not entry: # Nothing for this one word (fail)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
438 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
439 for fileid in hits.keys(): # Eliminate hits for every non-match
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
440 if not entry.has_key(fileid):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
441 del hits[fileid]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
442 if print_report:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
443 self.print_report(hits, wordlist, entries)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
444 return hits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
445
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
446 def print_report(self, hits={}, wordlist=[], entries={}):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
447 # Figure out what to actually print (based on QUIET level)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
448 output = []
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
449 for fileid,fname in hits.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
450 message = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
451 if self.quiet <= 3:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
452 wordcount = self.files[fname][1]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
453 matches = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
454 countmess = '\n'+' '*13+`wordcount`+' words; '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
455 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
456 if not self.casesensitive:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
457 word = string.upper(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
458 occurs = entries[word][fileid]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
459 matches = matches+occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
460 countmess = countmess +`occurs`+' '+word+'; '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
461 message = string.ljust('[RATING: '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
462 +`1000*matches/wordcount`+']',13)+message
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
463 if self.quiet <= 2: message = message +countmess +'\n'
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
464 if self.filter: # Using an output filter
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
465 if fnmatch.fnmatch(message, self.filter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
466 output.append(message)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
467 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
468 output.append(message)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
469
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
470 if self.quiet <= 5:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
471 print string.join(output,'\n')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
472 sys.stderr.write('\n'+`len(output)`+' files matched wordlist: '+
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
473 `wordlist`+'\n')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
474 return output
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
475
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
476 def purge_entry(self, fname, file_dct, word_dct):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
477 "Remove a file from file index and word index"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
478 try: # The easy part, cleanup the file index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
479 file_index = file_dct[fname]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
480 del file_dct[fname]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
481 except KeyError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
482 pass # We'll assume we only encounter KeyError's
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
483 # The much harder part, cleanup the word index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
484 for word, occurs in word_dct.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
485 if occurs.has_key(file_index):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
486 del occurs[file_index]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
487 word_dct[word] = occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
488
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
489 def index_loaded(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
490 return ( hasattr(self,'fileids') and
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
491 hasattr(self,'files') and
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
492 hasattr(self,'words') )
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
493
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
494 #-- Provide an actual storage facility for the indexes (i.e. shelve)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
495 class ShelveIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
496 """Concrete Indexer utilizing [shelve] for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
497
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
498 Unfortunately, [shelve] proves far too slow in indexing, while
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
499 creating monstrously large indexes. Not recommend, at least under
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
500 the default dbm's tested. Also, class may be broken because
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
501 shelves do not, apparently, support the .values() and .items()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
502 methods. Fixing this is a low priority, but the sample code is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
503 left here.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
504 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
505 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
506 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
507 import shelve
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
508 self.words = shelve.open(INDEXDB+".WORDS")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
509 self.files = shelve.open(INDEXDB+".FILES")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
510 self.fileids = shelve.open(INDEXDB+".FILEIDS")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
511 if not FILES: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
512 self.files['_TOP'] = (0,None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
513
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
514 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
515 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
516 pass
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
517
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
518 class FlatIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
519 """Concrete Indexer utilizing flat-file for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
520
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
521 See the comments in the referenced article for details; in
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
522 brief, this indexer has about the same timing as the best in
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
523 -creating- indexes and the storage requirements are
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
524 reasonable. However, actually -using- a flat-file index is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
525 more than an order of magnitude worse than the best indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
526 (ZPickleIndexer wins overall).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
527
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
528 On the other hand, FlatIndexer creates a wonderfully easy to
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
529 parse database format if you have a reason to transport the
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
530 index to a different platform or programming language. And
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
531 should you perform indexing as part of a long-running
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
532 process, the overhead of initial file parsing becomes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
533 irrelevant.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
534 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
535 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
536 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
537 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
538 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
539 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
540 self.words = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
541 self.files = {'_TOP':(0,None)}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
542 self.fileids = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
543 try: # Read index contents
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
544 for line in open(INDEXDB).readlines():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
545 fields = string.split(line)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
546 if fields[0] == '-': # Read a file/fileid line
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
547 fileid = eval(fields[2])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
548 wordcount = eval(fields[3])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
549 fname = fields[1]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
550 self.files[fname] = (fileid, wordcount)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
551 self.fileids[fileid] = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
552 else: # Read a word entry (dict of hits)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
553 entries = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
554 word = fields[0]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
555 for n in range(1,len(fields),2):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
556 fileid = eval(fields[n])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
557 occurs = eval(fields[n+1])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
558 entries[fileid] = occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
559 self.words[word] = entries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
560 except:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
561 pass # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
562
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
563 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
564 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
565 tab, lf, sp = '\t','\n',' '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
566 indexdb = open(INDEXDB,'w')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
567 for fname,entry in self.files.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
568 indexdb.write('- '+fname +tab +`entry[0]` +tab +`entry[1]` +lf)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
569 for word,entry in self.words.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
570 indexdb.write(word +tab+tab)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
571 for fileid,occurs in entry.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
572 indexdb.write(`fileid` +sp +`occurs` +sp)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
573 indexdb.write(lf)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
574
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
575 class PickleIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
576 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
577 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
578 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
579 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
580 import cPickle
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
581 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
582 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
583 pickle_str = open(INDEXDB,'rb').read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
584 db = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
585 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
586 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
587 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
588
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
589 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
590 import cPickle
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
591 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
592 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
593 open(INDEXDB,'wb').write(cPickle.dumps(db, 1))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
594
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
595 class XMLPickleIndexer(PickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
596 """Concrete Indexer utilizing XML for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
597
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
598 While this is, as expected, a verbose format, the possibility
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
599 of using XML as a transport format for indexes might be
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
600 useful. However, [xml_pickle] is in need of some redesign to
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
601 avoid gross inefficiency when creating very large
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
602 (multi-megabyte) output files (fixed in [xml_pickle] version
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
603 0.48 or above)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
604 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
605 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
606 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
607 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
608 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
609 from gnosis.xml.pickle import XML_Pickler
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
610 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
611 try: # XML file exists
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
612 xml_str = open(INDEXDB).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
613 db = XML_Pickler().loads(xml_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
614 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
615 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
616 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
617
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
618 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
619 from gnosis.xml.pickle import XML_Pickler
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
620 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
621 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
622 open(INDEXDB,'w').write(XML_Pickler(db).dumps())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
623
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
624 class ZPickleIndexer(PickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
625 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
626 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
627 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
628 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
629 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
630 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
631 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
632 pickle_str = zlib.decompress(open(INDEXDB+'!','rb').read())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
633 db = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
634 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
635 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
636 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
637
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
638 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
639 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
640 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
641 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
642 pickle_fh = open(INDEXDB+'!','wb')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
643 pickle_fh.write(zlib.compress(cPickle.dumps(db, 1)))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
644
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
645
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
646 class SlicedZPickleIndexer(ZPickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
647 segments = "ABCDEFGHIJKLMNOPQRSTUVWXYZ#-!"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
648 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
649 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
650 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
651 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
652 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
653 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
654 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
655 # Identify the relevant word-dictionary segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
656 if not wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
657 segments = self.segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
658 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
659 segments = ['-','#']
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
660 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
661 segments.append(string.upper(word[0]))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
662 # Load the segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
663 for segment in segments:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
664 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
665 pickle_str = zlib.decompress(open(INDEXDB+segment,'rb').read())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
666 dbslice = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
667 if dbslice.__dict__.get('WORDS'): # If it has some words, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
668 for word,entry in dbslice.WORDS.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
669 db.WORDS[word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
670 if dbslice.__dict__.get('FILES'): # If it has some files, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
671 db.FILES = dbslice.FILES
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
672 if dbslice.__dict__.get('FILEIDS'): # If it has fileids, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
673 db.FILEIDS = dbslice.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
674 except:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
675 pass # No biggie, couldn't find this segment
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
676 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
677
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
678 def julienne(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
679 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
680 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
681 segments = self.segments # all the (little) indexes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
682 for segment in segments:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
683 try: # brutal space saver... delete all the small segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
684 os.remove(INDEXDB+segment)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
685 except OSError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
686 pass # probably just nonexistent segment index file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
687 # First write the much simpler filename/fileid dictionaries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
688 dbfil = Index(None, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
689 open(INDEXDB+'-','wb').write(zlib.compress(cPickle.dumps(dbfil,1)))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
690 # The hard part is splitting the word dictionary up, of course
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
691 letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
692 segdicts = {} # Need batch of empty dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
693 for segment in letters+'#':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
694 segdicts[segment] = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
695 for word, entry in self.words.items(): # Split into segment dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
696 initchar = string.upper(word[0])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
697 if initchar in letters:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
698 segdicts[initchar][word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
699 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
700 segdicts['#'][word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
701 for initchar in letters+'#':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
702 db = Index(segdicts[initchar], None, None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
703 pickle_str = cPickle.dumps(db, 1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
704 filename = INDEXDB+initchar
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
705 pickle_fh = open(filename,'wb')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
706 pickle_fh.write(zlib.compress(pickle_str))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
707 os.chmod(filename,0664)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
708
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
709 save_index = julienne
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
710
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
711 PreferredIndexer = SlicedZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
712
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
713 #-- If called from command-line, parse arguments and take actions
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
714 if __name__ == '__main__':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
715 import time
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
716 start = time.time()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
717 search_words = [] # Word search list (if specified)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
718 opts = 0 # Any options specified?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
719 if len(sys.argv) < 2:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
720 pass # No options given
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
721 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
722 upper = string.upper
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
723 dir = os.getcwd() # Default to indexing from current directory
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
724 descend = 1 # Default to recursive indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
725 ndx = PreferredIndexer()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
726 for opt in sys.argv[1:]:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
727 if opt in ('-h','/h','-?','/?','?','--help'): # help screen
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
728 print __shell_usage__
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
729 opts = -1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
730 break
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
731 elif opt[0] in '/-': # a switch!
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
732 opts = opts+1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
733 if upper(opt[1:]) == 'INDEX': # Index files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
734 ndx.quiet = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
735 pass # Use defaults if no other options
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
736 elif upper(opt[1:]) == 'REINDEX': # Reindex
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
737 ndx.reindex = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
738 elif upper(opt[1:]) == 'CASESENSITIVE': # Case sensitive
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
739 ndx.casesensitive = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
740 elif upper(opt[1:]) in ('NORECURSE','LOCAL'): # No recursion
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
741 descend = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
742 elif upper(opt[1:4]) == 'DIR': # Dir to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
743 dir = opt[5:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
744 elif upper(opt[1:8]) == 'INDEXDB': # Index specified
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
745 ndx.indexdb = opt[9:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
746 sys.stderr.write(
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
747 "Use of INDEXER_DB environment variable is STRONGLY recommended.\n")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
748 elif upper(opt[1:6]) == 'REGEX': # RegEx files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
749 ndx.add_pattern = re.compile(opt[7:])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
750 elif upper(opt[1:5]) == 'GLOB': # Glob files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
751 ndx.add_pattern = opt[6:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
752 elif upper(opt[1:7]) in ('OUTPUT','FORMAT'): # How should results look?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
753 opts = opts-1 # this is not an option for indexing purposes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
754 level = upper(opt[8:])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
755 if level in ('ALL','EVERYTHING','VERBOSE', 'MAX'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
756 ndx.quiet = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
757 elif level in ('RATINGS','SCORES','HIGH'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
758 ndx.quiet = 3
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
759 elif level in ('FILENAMES','NAMES','FILES','MID'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
760 ndx.quiet = 5
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
761 elif level in ('SUMMARY','MIN'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
762 ndx.quiet = 9
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
763 elif upper(opt[1:7]) == 'FILTER': # Regex filter output
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
764 opts = opts-1 # this is not an option for indexing purposes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
765 ndx.filter = opt[8:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
766 elif opt[1:] in string.digits:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
767 opts = opts-1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
768 ndx.quiet = eval(opt[1])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
769 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
770 search_words.append(opt) # Search words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
771
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
772 if opts > 0:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
773 ndx.add_files(dir=dir)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
774 ndx.save_index()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
775 if search_words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
776 ndx.find(search_words, print_report=1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
777 if not opts and not search_words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
778 sys.stderr.write("Perhaps you would like to use the --help option?\n")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
779 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
780 sys.stderr.write('Processed in %.3f seconds (%s)'
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
781 % (time.time()-start, ndx.whoami()))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
782
682
b4d13f7cc6c4 Oops. Forgot to include cvs keywords in file.
Roche Compaan <rochecompaan@users.sourceforge.net>
parents: 681
diff changeset
783 #
b4d13f7cc6c4 Oops. Forgot to include cvs keywords in file.
Roche Compaan <rochecompaan@users.sourceforge.net>
parents: 681
diff changeset
784 #$Log: not supported by cvs2svn $

Roundup Issue Tracker: http://roundup-tracker.org/