annotate roundup/indexer.py @ 681:1b2d0e702ca8 search_indexing-0-4-2-branch

Added feature [SF#526730] - search for messages capability
author Roche Compaan <rochecompaan@users.sourceforge.net>
date Wed, 03 Apr 2002 11:55:57 +0000
parents
children b4d13f7cc6c4
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
681
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
1 #!/usr/bin/env python
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
2
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
3 """Create full-text indexes and search them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
4
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
5 Notes:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
6
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
7 See http://gnosis.cx/publish/programming/charming_python_15.txt
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
8 for a detailed discussion of this module.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
9
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
10 This version requires Python 1.6+. It turns out that the use
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
11 of string methods rather than [string] module functions is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
12 enough faster in a tight loop so as to provide a quite
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
13 remarkable 25% speedup in overall indexing. However, only FOUR
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
14 lines in TextSplitter.text_splitter() were changed away from
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
15 Python 1.5 compatibility. Those lines are followed by comments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
16 beginning with "# 1.52: " that show the old forms. Python
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
17 1.5 users can restore these lines, and comment out those just
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
18 above them.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
19
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
20 Classes:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
21
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
22 GenericIndexer -- Abstract class
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
23 TextSplitter -- Mixin class
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
24 Index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
25 ShelveIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
26 FlatIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
27 XMLPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
28 PickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
29 ZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
30 SlicedZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
31
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
32 Functions:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
33
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
34 echo_fname(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
35 recurse_files(...)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
36
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
37 Index Formats:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
38
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
39 *Indexer.files: filename --> (fileid, wordcount)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
40 *Indexer.fileids: fileid --> filename
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
41 *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
42
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
43 Module Usage:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
44
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
45 There are a few ways to use this module. Just to utilize existing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
46 functionality, something like the following is a likely
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
47 pattern:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
48
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
49 import gnosis.indexer as indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
50 index = indexer.MyFavoriteIndexer() # For some concrete Indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
51 index.load_index('myIndex.db')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
52 index.add_files(dir='/this/that/otherdir', pattern='*.txt')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
53 hits = index.find(['spam','eggs','bacon'])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
54 index.print_report(hits)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
55
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
56 To customize the basic classes, something like the following is likely:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
57
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
58 class MySplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
59 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
60 "Peform much better splitting than default (for filetypes)"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
61 # ...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
62 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
63
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
64 class MyIndexer(indexer.GenericIndexer, MySplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
65 def load_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
66 "Retrieve three dictionaries from clever storage method"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
67 # ...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
68 self.words, self.files, self.fileids = WORDS, FILES, FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
69 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
70 "Save three dictionaries to clever storage method"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
71
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
72 index = MyIndexer()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
73 # ...etc...
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
74
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
75 Benchmarks:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
76
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
77 As we know, there are lies, damn lies, and benchmarks. Take
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
78 the below with an adequate dose of salt. In version 0.10 of
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
79 the concrete indexers, some performance was tested. The
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
80 test case was a set of mail/news archives, that were about
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
81 43 mB, and 225 files. In each case, an index was generated
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
82 (if possible), and a search for the words "xml python" was
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
83 performed.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
84
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
85 - Index w/ PickleIndexer: 482s, 2.4 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
86 - Search w/ PickleIndexer: 1.74s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
87 - Index w/ ZPickleIndexer: 484s, 1.2 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
88 - Search w/ ZPickleIndexer: 1.77s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
89 - Index w/ FlatIndexer: 492s, 2.6 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
90 - Search w/ FlatIndexer: 53s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
91 - Index w/ ShelveIndexer: (dumbdbm) Many minutes, tens of mBs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
92 - Search w/ ShelveIndexer: Aborted before completely indexed
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
93 - Index w/ ShelveIndexer: (dbhash) Long time (partial crash), 10 mB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
94 - Search w/ ShelveIndexer: N/A. Too many glitches
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
95 - Index w/ XMLPickleIndexer: Memory error (xml_pickle uses bad string
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
96 composition for large output)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
97 - Search w/ XMLPickleIndexer: N/A
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
98 - grep search (xml|python): 20s (cached: <5s)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
99 - 'srch' utility (python): 12s
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
100 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
101 __shell_usage__ = """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
102 Shell Usage: [python] indexer.py [options] [search_words]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
103
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
104 -h, /h, -?, /?, ?, --help: Show this help screen
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
105 -index: Add files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
106 -reindex: Refresh files already in the index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
107 (can take much more time)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
108 -casesensitive: Maintain the case of indexed words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
109 (can lead to MUCH larger indices)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
110 -norecurse, -local: Only index starting dir, not subdirs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
111 -dir=<directory>: Starting directory for indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
112 (default is current directory)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
113 -indexdb=<database>: Use specified index database
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
114 (environ variable INDEXER_DB is preferred)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
115 -regex=<pattern>: Index files matching regular expression
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
116 -glob=<pattern>: Index files matching glob pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
117 -filter=<pattern> Only display results matching pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
118 -output=<op>, -format=<opt>: How much detail on matches?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
119 -<digit>: Quiet level (0=verbose ... 9=quiet)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
120
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
121 Output/format options are ALL/EVERYTHING/VERBOSE, RATINGS/SCORES,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
122 FILENAMES/NAMES/FILES, SUMMARY/REPORT"""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
123
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
124 __version__ = "$Revision: 1.1.2.1 $"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
125 __author__=["David Mertz (mertz@gnosis.cx)",]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
126 __thanks_to__=["Pat Knight (p.knight@ktgroup.co.uk)",
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
127 "Gregory Popovitch (greg@gpy.com)", ]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
128 __copyright__="""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
129 This file is released to the public domain. I (dqm) would
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
130 appreciate it if you choose to keep derived works under terms
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
131 that promote freedom, but obviously am giving up any rights
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
132 to compel such.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
133 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
134
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
135 __history__="""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
136 0.1 Initial version.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
137
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
138 0.11 Tweaked TextSplitter after some random experimentation.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
139
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
140 0.12 Added SlicedZPickleIndexer (best choice, so far).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
141
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
142 0.13 Pat Knight pointed out need for binary open()'s of
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
143 certain files under Windows.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
144
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
145 0.14 Added '-filter' switch to search results.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
146
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
147 0.15 Added direct read of gzip files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
148
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
149 0.20 Gregory Popovitch did some profiling on TextSplitter,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
150 and provided both huge speedups to the Python version
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
151 and hooks to a C extension class (ZopeTextSplitter).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
152 A little refactoring by he and I (dqm) has nearly
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
153 doubled the speed of indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
154
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
155 0.30 Module refactored into gnosis package. This is a
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
156 first pass, and various documentation and test cases
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
157 should be added later.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
158 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
159 import string, re, os, fnmatch, sys, copy, gzip
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
160 from types import *
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
161
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
162 #-- Silly "do nothing" default recursive file processor
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
163 def echo_fname(fname): print fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
164
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
165 #-- "Recurse and process files" utility function
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
166 def recurse_files(curdir, pattern, exclusions, func=echo_fname, *args, **kw):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
167 "Recursively process file pattern"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
168 subdirs, files = [],[]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
169 level = kw.get('level',0)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
170
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
171 for name in os.listdir(curdir):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
172 fname = os.path.join(curdir, name)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
173 if name[-4:] in exclusions:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
174 pass # do not include binary file type
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
175 elif os.path.isdir(fname) and not os.path.islink(fname):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
176 subdirs.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
177 # kludge to detect a regular expression across python versions
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
178 elif sys.version[0]=='1' and isinstance(pattern, re.RegexObject):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
179 if pattern.match(name):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
180 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
181 elif sys.version[0]=='2' and type(pattern)==type(re.compile('')):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
182 if pattern.match(name):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
183 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
184 elif type(pattern) is StringType:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
185 if fnmatch.fnmatch(name, pattern):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
186 files.append(fname)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
187
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
188 for fname in files:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
189 apply(func, (fname,)+args)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
190 for subdir in subdirs:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
191 recurse_files(subdir, pattern, exclusions, func, level=level+1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
192
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
193 #-- Data bundle for index dictionaries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
194 class Index:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
195 def __init__(self, words, files, fileids):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
196 if words is not None: self.WORDS = words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
197 if files is not None: self.FILES = files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
198 if fileids is not None: self.FILEIDS = fileids
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
199
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
200 #-- "Split plain text into words" utility function
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
201 class TextSplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
202 def initSplitter(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
203 prenum = string.join(map(chr, range(0,48)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
204 num2cap = string.join(map(chr, range(58,65)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
205 cap2low = string.join(map(chr, range(91,97)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
206 postlow = string.join(map(chr, range(123,256)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
207 nonword = prenum + num2cap + cap2low + postlow
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
208 self.word_only = string.maketrans(nonword, " "*len(nonword))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
209 self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
210 self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
211 self.ident = string.join(map(chr, range(256)), '')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
212 self.init = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
213
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
214 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
215 "Split the contents of a text string into a list of 'words'"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
216 if ftype == 'text/plain':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
217 words = self.text_splitter(text, self.casesensitive)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
218 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
219 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
220 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
221
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
222 def text_splitter(self, text, casesensitive=0):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
223 """Split text/plain string into a list of words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
224
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
225 In version 0.20 this function is still fairly weak at
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
226 identifying "real" words, and excluding gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
227 strings. As long as the indexer looks at "real" text
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
228 files, it does pretty well; but if indexing of binary
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
229 data is attempted, a lot of gibberish gets indexed.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
230 Suggestions on improving this are GREATLY APPRECIATED.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
231 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
232 # Initialize some constants
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
233 if not hasattr(self,'init'): self.initSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
234
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
235 # Speedup trick: attributes into local scope
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
236 word_only = self.word_only
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
237 ident = self.ident
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
238 alpha = self.alpha
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
239 nondigits = self.nondigits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
240 translate = string.translate
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
241
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
242 # Let's adjust case if not case-sensitive
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
243 if not casesensitive: text = string.upper(text)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
244
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
245 # Split the raw text
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
246 allwords = string.split(text)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
247
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
248 # Finally, let's skip some words not worth indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
249 words = []
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
250 for word in allwords:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
251 if len(word) > 25: continue # too long (probably gibberish)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
252
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
253 # Identify common patterns in non-word data (binary, UU/MIME, etc)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
254 num_nonalpha = len(word.translate(ident, alpha))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
255 numdigits = len(word.translate(ident, nondigits))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
256 # 1.52: num_nonalpha = len(translate(word, ident, alpha))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
257 # 1.52: numdigits = len(translate(word, ident, nondigits))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
258 if numdigits > len(word)-2: # almost all digits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
259 if numdigits > 5: # too many digits is gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
260 continue # a moderate number is year/zipcode/etc
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
261 elif num_nonalpha*3 > len(word): # too much scattered nonalpha = gibberish
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
262 continue
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
263
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
264 word = word.translate(word_only) # Let's strip funny byte values
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
265 # 1.52: word = translate(word, word_only)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
266 subwords = word.split() # maybe embedded non-alphanumeric
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
267 # 1.52: subwords = string.split(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
268 for subword in subwords: # ...so we might have subwords
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
269 if len(subword) <= 2: continue # too short a subword
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
270 words.append(subword)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
271 return words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
272
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
273 class ZopeTextSplitter:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
274 def initSplitter(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
275 import Splitter
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
276 stop_words=(
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
277 'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above', 'across',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
278 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
279 'along', 'already', 'also', 'although', 'always', 'am', 'among',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
280 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
281 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
282 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
283 'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
284 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
285 'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
286 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
287 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
288 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
289 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
290 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
291 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
292 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
293 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
294 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
295 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
296 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
297 'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
298 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
299 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
300 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
301 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
302 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
303 'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
304 'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
305 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
306 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
307 'somehow', 'someone', 'something', 'sometime', 'sometimes',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
308 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
309 'their', 'them', 'themselves', 'then', 'thence', 'there',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
310 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
311 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
312 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
313 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
314 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
315 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
316 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
317 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
318 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
319 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves',
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
320 )
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
321 self.stop_word_dict={}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
322 for word in stop_words: self.stop_word_dict[word]=None
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
323 self.splitterobj = Splitter.getSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
324 self.init = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
325
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
326 def goodword(self, word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
327 return len(word) < 25
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
328
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
329 def splitter(self, text, ftype):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
330 """never case-sensitive"""
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
331 if not hasattr(self,'init'): self.initSplitter()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
332 return filter(self.goodword, self.splitterobj(text, self.stop_word_dict))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
333
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
334
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
335 #-- "Abstract" parent class for inherited indexers
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
336 # (does not handle storage in parent, other methods are primitive)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
337
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
338 class GenericIndexer:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
339 def __init__(self, **kw):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
340 apply(self.configure, (), kw)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
341
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
342 def whoami(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
343 return self.__class__.__name__
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
344
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
345 def configure(self, REINDEX=0, CASESENSITIVE=0,
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
346 INDEXDB=os.environ.get('INDEXER_DB', 'TEMP_NDX.DB'),
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
347 ADD_PATTERN='*', QUIET=5):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
348 "Configure settings used by indexing and storage/retrieval"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
349 self.indexdb = INDEXDB
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
350 self.reindex = REINDEX
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
351 self.casesensitive = CASESENSITIVE
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
352 self.add_pattern = ADD_PATTERN
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
353 self.quiet = QUIET
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
354 self.filter = None
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
355
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
356 def add_files(self, dir=os.getcwd(), pattern=None, descend=1):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
357 self.load_index()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
358 exclusions = ('.zip','.pyc','.gif','.jpg','.dat','.dir')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
359 if not pattern:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
360 pattern = self.add_pattern
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
361 recurse_files(dir, pattern, exclusions, self.add_file)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
362 # Rebuild the fileid index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
363 self.fileids = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
364 for fname in self.files.keys():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
365 fileid = self.files[fname][0]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
366 self.fileids[fileid] = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
367
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
368 def add_file(self, fname, ftype='text/plain'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
369 "Index the contents of a regular file"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
370 if self.files.has_key(fname): # Is file eligible for (re)indexing?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
371 if self.reindex: # Reindexing enabled, cleanup dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
372 self.purge_entry(fname, self.files, self.words)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
373 else: # DO NOT reindex this file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
374 if self.quiet < 5: print "Skipping", fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
375 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
376
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
377 # Read in the file (if possible)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
378 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
379 if fname[-3:] == '.gz':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
380 text = gzip.open(fname).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
381 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
382 text = open(fname).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
383 if self.quiet < 5: print "Indexing", fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
384 except IOError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
385 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
386 words = self.splitter(text, ftype)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
387
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
388 # Find new file index, and assign it to filename
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
389 # (_TOP uses trick of negative to avoid conflict with file index)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
390 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
391 file_index = abs(self.files['_TOP'][0])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
392 self.files[fname] = (file_index, len(words))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
393
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
394 filedict = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
395 for word in words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
396 if filedict.has_key(word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
397 filedict[word] = filedict[word]+1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
398 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
399 filedict[word] = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
400
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
401 for word in filedict.keys():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
402 if self.words.has_key(word):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
403 entry = self.words[word]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
404 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
405 entry = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
406 entry[file_index] = filedict[word]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
407 self.words[word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
408
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
409 def add_othertext(self, identifier):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
410 """Index a textual source other than a plain file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
411
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
412 A child class might want to implement this method (or a similar one)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
413 in order to index textual sources such as SQL tables, URLs, clay
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
414 tablets, or whatever else. The identifier should uniquely pick out
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
415 the source of the text (whatever it is)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
416 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
417 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
418
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
419 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
420 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
421
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
422 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
423 raise NotImplementedError
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
424
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
425 def find(self, wordlist, print_report=0):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
426 "Locate files that match ALL the words in wordlist"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
427 self.load_index(wordlist=wordlist)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
428 entries = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
429 hits = copy.copy(self.fileids) # Copy of fileids index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
430 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
431 if not self.casesensitive:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
432 word = string.upper(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
433 entry = self.words.get(word) # For each word, get index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
434 entries[word] = entry # of matching files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
435 if not entry: # Nothing for this one word (fail)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
436 return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
437 for fileid in hits.keys(): # Eliminate hits for every non-match
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
438 if not entry.has_key(fileid):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
439 del hits[fileid]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
440 if print_report:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
441 self.print_report(hits, wordlist, entries)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
442 return hits
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
443
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
444 def print_report(self, hits={}, wordlist=[], entries={}):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
445 # Figure out what to actually print (based on QUIET level)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
446 output = []
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
447 for fileid,fname in hits.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
448 message = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
449 if self.quiet <= 3:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
450 wordcount = self.files[fname][1]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
451 matches = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
452 countmess = '\n'+' '*13+`wordcount`+' words; '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
453 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
454 if not self.casesensitive:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
455 word = string.upper(word)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
456 occurs = entries[word][fileid]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
457 matches = matches+occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
458 countmess = countmess +`occurs`+' '+word+'; '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
459 message = string.ljust('[RATING: '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
460 +`1000*matches/wordcount`+']',13)+message
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
461 if self.quiet <= 2: message = message +countmess +'\n'
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
462 if self.filter: # Using an output filter
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
463 if fnmatch.fnmatch(message, self.filter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
464 output.append(message)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
465 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
466 output.append(message)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
467
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
468 if self.quiet <= 5:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
469 print string.join(output,'\n')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
470 sys.stderr.write('\n'+`len(output)`+' files matched wordlist: '+
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
471 `wordlist`+'\n')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
472 return output
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
473
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
474 def purge_entry(self, fname, file_dct, word_dct):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
475 "Remove a file from file index and word index"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
476 try: # The easy part, cleanup the file index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
477 file_index = file_dct[fname]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
478 del file_dct[fname]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
479 except KeyError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
480 pass # We'll assume we only encounter KeyError's
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
481 # The much harder part, cleanup the word index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
482 for word, occurs in word_dct.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
483 if occurs.has_key(file_index):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
484 del occurs[file_index]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
485 word_dct[word] = occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
486
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
487 def index_loaded(self):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
488 return ( hasattr(self,'fileids') and
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
489 hasattr(self,'files') and
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
490 hasattr(self,'words') )
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
491
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
492 #-- Provide an actual storage facility for the indexes (i.e. shelve)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
493 class ShelveIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
494 """Concrete Indexer utilizing [shelve] for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
495
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
496 Unfortunately, [shelve] proves far too slow in indexing, while
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
497 creating monstrously large indexes. Not recommend, at least under
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
498 the default dbm's tested. Also, class may be broken because
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
499 shelves do not, apparently, support the .values() and .items()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
500 methods. Fixing this is a low priority, but the sample code is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
501 left here.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
502 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
503 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
504 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
505 import shelve
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
506 self.words = shelve.open(INDEXDB+".WORDS")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
507 self.files = shelve.open(INDEXDB+".FILES")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
508 self.fileids = shelve.open(INDEXDB+".FILEIDS")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
509 if not FILES: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
510 self.files['_TOP'] = (0,None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
511
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
512 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
513 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
514 pass
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
515
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
516 class FlatIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
517 """Concrete Indexer utilizing flat-file for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
518
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
519 See the comments in the referenced article for details; in
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
520 brief, this indexer has about the same timing as the best in
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
521 -creating- indexes and the storage requirements are
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
522 reasonable. However, actually -using- a flat-file index is
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
523 more than an order of magnitude worse than the best indexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
524 (ZPickleIndexer wins overall).
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
525
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
526 On the other hand, FlatIndexer creates a wonderfully easy to
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
527 parse database format if you have a reason to transport the
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
528 index to a different platform or programming language. And
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
529 should you perform indexing as part of a long-running
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
530 process, the overhead of initial file parsing becomes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
531 irrelevant.
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
532 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
533 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
534 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
535 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
536 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
537 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
538 self.words = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
539 self.files = {'_TOP':(0,None)}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
540 self.fileids = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
541 try: # Read index contents
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
542 for line in open(INDEXDB).readlines():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
543 fields = string.split(line)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
544 if fields[0] == '-': # Read a file/fileid line
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
545 fileid = eval(fields[2])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
546 wordcount = eval(fields[3])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
547 fname = fields[1]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
548 self.files[fname] = (fileid, wordcount)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
549 self.fileids[fileid] = fname
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
550 else: # Read a word entry (dict of hits)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
551 entries = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
552 word = fields[0]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
553 for n in range(1,len(fields),2):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
554 fileid = eval(fields[n])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
555 occurs = eval(fields[n+1])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
556 entries[fileid] = occurs
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
557 self.words[word] = entries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
558 except:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
559 pass # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
560
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
561 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
562 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
563 tab, lf, sp = '\t','\n',' '
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
564 indexdb = open(INDEXDB,'w')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
565 for fname,entry in self.files.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
566 indexdb.write('- '+fname +tab +`entry[0]` +tab +`entry[1]` +lf)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
567 for word,entry in self.words.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
568 indexdb.write(word +tab+tab)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
569 for fileid,occurs in entry.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
570 indexdb.write(`fileid` +sp +`occurs` +sp)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
571 indexdb.write(lf)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
572
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
573 class PickleIndexer(GenericIndexer, TextSplitter):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
574 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
575 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
576 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
577 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
578 import cPickle
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
579 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
580 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
581 pickle_str = open(INDEXDB,'rb').read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
582 db = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
583 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
584 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
585 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
586
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
587 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
588 import cPickle
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
589 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
590 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
591 open(INDEXDB,'wb').write(cPickle.dumps(db, 1))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
592
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
593 class XMLPickleIndexer(PickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
594 """Concrete Indexer utilizing XML for storage
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
595
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
596 While this is, as expected, a verbose format, the possibility
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
597 of using XML as a transport format for indexes might be
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
598 useful. However, [xml_pickle] is in need of some redesign to
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
599 avoid gross inefficiency when creating very large
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
600 (multi-megabyte) output files (fixed in [xml_pickle] version
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
601 0.48 or above)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
602 """
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
603 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
604 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
605 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
606 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
607 from gnosis.xml.pickle import XML_Pickler
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
608 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
609 try: # XML file exists
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
610 xml_str = open(INDEXDB).read()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
611 db = XML_Pickler().loads(xml_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
612 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
613 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
614 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
615
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
616 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
617 from gnosis.xml.pickle import XML_Pickler
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
618 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
619 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
620 open(INDEXDB,'w').write(XML_Pickler(db).dumps())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
621
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
622 class ZPickleIndexer(PickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
623 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
624 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
625 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
626 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
627 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
628 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
629 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
630 pickle_str = zlib.decompress(open(INDEXDB+'!','rb').read())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
631 db = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
632 except: # New index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
633 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
634 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
635
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
636 def save_index(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
637 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
638 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
639 db = Index(self.words, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
640 pickle_fh = open(INDEXDB+'!','wb')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
641 pickle_fh.write(zlib.compress(cPickle.dumps(db, 1)))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
642
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
643
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
644 class SlicedZPickleIndexer(ZPickleIndexer):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
645 segments = "ABCDEFGHIJKLMNOPQRSTUVWXYZ#-!"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
646 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
647 # Unless reload is indicated, do not load twice
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
648 if self.index_loaded() and not reload: return 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
649 # Ok, now let's actually load it
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
650 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
651 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
652 db = Index({}, {'_TOP':(0,None)}, {})
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
653 # Identify the relevant word-dictionary segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
654 if not wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
655 segments = self.segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
656 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
657 segments = ['-','#']
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
658 for word in wordlist:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
659 segments.append(string.upper(word[0]))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
660 # Load the segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
661 for segment in segments:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
662 try:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
663 pickle_str = zlib.decompress(open(INDEXDB+segment,'rb').read())
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
664 dbslice = cPickle.loads(pickle_str)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
665 if dbslice.__dict__.get('WORDS'): # If it has some words, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
666 for word,entry in dbslice.WORDS.items():
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
667 db.WORDS[word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
668 if dbslice.__dict__.get('FILES'): # If it has some files, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
669 db.FILES = dbslice.FILES
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
670 if dbslice.__dict__.get('FILEIDS'): # If it has fileids, add them
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
671 db.FILEIDS = dbslice.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
672 except:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
673 pass # No biggie, couldn't find this segment
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
674 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
675
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
676 def julienne(self, INDEXDB=None):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
677 import cPickle, zlib
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
678 INDEXDB = INDEXDB or self.indexdb
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
679 segments = self.segments # all the (little) indexes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
680 for segment in segments:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
681 try: # brutal space saver... delete all the small segments
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
682 os.remove(INDEXDB+segment)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
683 except OSError:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
684 pass # probably just nonexistent segment index file
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
685 # First write the much simpler filename/fileid dictionaries
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
686 dbfil = Index(None, self.files, self.fileids)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
687 open(INDEXDB+'-','wb').write(zlib.compress(cPickle.dumps(dbfil,1)))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
688 # The hard part is splitting the word dictionary up, of course
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
689 letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
690 segdicts = {} # Need batch of empty dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
691 for segment in letters+'#':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
692 segdicts[segment] = {}
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
693 for word, entry in self.words.items(): # Split into segment dicts
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
694 initchar = string.upper(word[0])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
695 if initchar in letters:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
696 segdicts[initchar][word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
697 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
698 segdicts['#'][word] = entry
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
699 for initchar in letters+'#':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
700 db = Index(segdicts[initchar], None, None)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
701 pickle_str = cPickle.dumps(db, 1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
702 filename = INDEXDB+initchar
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
703 pickle_fh = open(filename,'wb')
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
704 pickle_fh.write(zlib.compress(pickle_str))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
705 os.chmod(filename,0664)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
706
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
707 save_index = julienne
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
708
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
709 PreferredIndexer = SlicedZPickleIndexer
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
710
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
711 #-- If called from command-line, parse arguments and take actions
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
712 if __name__ == '__main__':
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
713 import time
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
714 start = time.time()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
715 search_words = [] # Word search list (if specified)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
716 opts = 0 # Any options specified?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
717 if len(sys.argv) < 2:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
718 pass # No options given
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
719 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
720 upper = string.upper
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
721 dir = os.getcwd() # Default to indexing from current directory
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
722 descend = 1 # Default to recursive indexing
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
723 ndx = PreferredIndexer()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
724 for opt in sys.argv[1:]:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
725 if opt in ('-h','/h','-?','/?','?','--help'): # help screen
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
726 print __shell_usage__
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
727 opts = -1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
728 break
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
729 elif opt[0] in '/-': # a switch!
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
730 opts = opts+1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
731 if upper(opt[1:]) == 'INDEX': # Index files
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
732 ndx.quiet = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
733 pass # Use defaults if no other options
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
734 elif upper(opt[1:]) == 'REINDEX': # Reindex
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
735 ndx.reindex = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
736 elif upper(opt[1:]) == 'CASESENSITIVE': # Case sensitive
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
737 ndx.casesensitive = 1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
738 elif upper(opt[1:]) in ('NORECURSE','LOCAL'): # No recursion
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
739 descend = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
740 elif upper(opt[1:4]) == 'DIR': # Dir to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
741 dir = opt[5:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
742 elif upper(opt[1:8]) == 'INDEXDB': # Index specified
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
743 ndx.indexdb = opt[9:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
744 sys.stderr.write(
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
745 "Use of INDEXER_DB environment variable is STRONGLY recommended.\n")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
746 elif upper(opt[1:6]) == 'REGEX': # RegEx files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
747 ndx.add_pattern = re.compile(opt[7:])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
748 elif upper(opt[1:5]) == 'GLOB': # Glob files to index
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
749 ndx.add_pattern = opt[6:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
750 elif upper(opt[1:7]) in ('OUTPUT','FORMAT'): # How should results look?
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
751 opts = opts-1 # this is not an option for indexing purposes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
752 level = upper(opt[8:])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
753 if level in ('ALL','EVERYTHING','VERBOSE', 'MAX'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
754 ndx.quiet = 0
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
755 elif level in ('RATINGS','SCORES','HIGH'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
756 ndx.quiet = 3
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
757 elif level in ('FILENAMES','NAMES','FILES','MID'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
758 ndx.quiet = 5
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
759 elif level in ('SUMMARY','MIN'):
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
760 ndx.quiet = 9
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
761 elif upper(opt[1:7]) == 'FILTER': # Regex filter output
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
762 opts = opts-1 # this is not an option for indexing purposes
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
763 ndx.filter = opt[8:]
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
764 elif opt[1:] in string.digits:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
765 opts = opts-1
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
766 ndx.quiet = eval(opt[1])
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
767 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
768 search_words.append(opt) # Search words
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
769
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
770 if opts > 0:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
771 ndx.add_files(dir=dir)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
772 ndx.save_index()
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
773 if search_words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
774 ndx.find(search_words, print_report=1)
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
775 if not opts and not search_words:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
776 sys.stderr.write("Perhaps you would like to use the --help option?\n")
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
777 else:
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
778 sys.stderr.write('Processed in %.3f seconds (%s)'
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
779 % (time.time()-start, ndx.whoami()))
1b2d0e702ca8 Added feature [SF#526730] - search for messages capability
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
780

Roundup Issue Tracker: http://roundup-tracker.org/