annotate roundup/indexer.py @ 752:a721f4e7ebbc

Installation note for people running the tests with a CVS checkout.
author Richard Jones <richard@users.sourceforge.net>
date Tue, 28 May 2002 11:52:08 +0000
parents 51c425129b35
children 254b8d112eec
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
1 #!/usr/bin/env python
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
2
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
3 """Create full-text indexes and search them
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
4
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
5 Notes:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
6
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
7 See http://gnosis.cx/publish/programming/charming_python_15.txt
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
8 for a detailed discussion of this module.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
9
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
10 This version requires Python 1.6+. It turns out that the use
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
11 of string methods rather than [string] module functions is
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
12 enough faster in a tight loop so as to provide a quite
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
13 remarkable 25% speedup in overall indexing. However, only FOUR
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
14 lines in TextSplitter.text_splitter() were changed away from
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
15 Python 1.5 compatibility. Those lines are followed by comments
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
16 beginning with "# 1.52: " that show the old forms. Python
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
17 1.5 users can restore these lines, and comment out those just
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
18 above them.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
19
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
20 Classes:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
21
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
22 GenericIndexer -- Abstract class
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
23 TextSplitter -- Mixin class
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
24 Index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
25 ShelveIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
26 FlatIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
27 XMLPickleIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
28 PickleIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
29 ZPickleIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
30 SlicedZPickleIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
31
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
32 Functions:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
33
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
34 echo_fname(fname)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
35 recurse_files(...)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
36
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
37 Index Formats:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
38
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
39 *Indexer.files: filename --> (fileid, wordcount)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
40 *Indexer.fileids: fileid --> filename
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
41 *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
42
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
43 Module Usage:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
44
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
45 There are a few ways to use this module. Just to utilize existing
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
46 functionality, something like the following is a likely
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
47 pattern:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
48
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
49 import gnosis.indexer as indexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
50 index = indexer.MyFavoriteIndexer() # For some concrete Indexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
51 index.load_index('myIndex.db')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
52 index.add_files(dir='/this/that/otherdir', pattern='*.txt')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
53 hits = index.find(['spam','eggs','bacon'])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
54 index.print_report(hits)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
55
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
56 To customize the basic classes, something like the following is likely:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
57
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
58 class MySplitter:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
59 def splitter(self, text, ftype):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
60 "Peform much better splitting than default (for filetypes)"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
61 # ...
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
62 return words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
63
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
64 class MyIndexer(indexer.GenericIndexer, MySplitter):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
65 def load_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
66 "Retrieve three dictionaries from clever storage method"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
67 # ...
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
68 self.words, self.files, self.fileids = WORDS, FILES, FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
69 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
70 "Save three dictionaries to clever storage method"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
71
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
72 index = MyIndexer()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
73 # ...etc...
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
74
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
75 Benchmarks:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
76
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
77 As we know, there are lies, damn lies, and benchmarks. Take
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
78 the below with an adequate dose of salt. In version 0.10 of
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
79 the concrete indexers, some performance was tested. The
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
80 test case was a set of mail/news archives, that were about
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
81 43 mB, and 225 files. In each case, an index was generated
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
82 (if possible), and a search for the words "xml python" was
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
83 performed.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
84
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
85 - Index w/ PickleIndexer: 482s, 2.4 mB
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
86 - Search w/ PickleIndexer: 1.74s
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
87 - Index w/ ZPickleIndexer: 484s, 1.2 mB
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
88 - Search w/ ZPickleIndexer: 1.77s
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
89 - Index w/ FlatIndexer: 492s, 2.6 mB
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
90 - Search w/ FlatIndexer: 53s
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
91 - Index w/ ShelveIndexer: (dumbdbm) Many minutes, tens of mBs
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
92 - Search w/ ShelveIndexer: Aborted before completely indexed
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
93 - Index w/ ShelveIndexer: (dbhash) Long time (partial crash), 10 mB
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
94 - Search w/ ShelveIndexer: N/A. Too many glitches
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
95 - Index w/ XMLPickleIndexer: Memory error (xml_pickle uses bad string
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
96 composition for large output)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
97 - Search w/ XMLPickleIndexer: N/A
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
98 - grep search (xml|python): 20s (cached: <5s)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
99 - 'srch' utility (python): 12s
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
100 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
101 #$Id: indexer.py,v 1.2 2002-05-25 07:16:24 rochecompaan Exp $
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
102
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
103 __shell_usage__ = """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
104 Shell Usage: [python] indexer.py [options] [search_words]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
105
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
106 -h, /h, -?, /?, ?, --help: Show this help screen
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
107 -index: Add files to index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
108 -reindex: Refresh files already in the index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
109 (can take much more time)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
110 -casesensitive: Maintain the case of indexed words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
111 (can lead to MUCH larger indices)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
112 -norecurse, -local: Only index starting dir, not subdirs
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
113 -dir=<directory>: Starting directory for indexing
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
114 (default is current directory)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
115 -indexdb=<database>: Use specified index database
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
116 (environ variable INDEXER_DB is preferred)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
117 -regex=<pattern>: Index files matching regular expression
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
118 -glob=<pattern>: Index files matching glob pattern
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
119 -filter=<pattern> Only display results matching pattern
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
120 -output=<op>, -format=<opt>: How much detail on matches?
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
121 -<digit>: Quiet level (0=verbose ... 9=quiet)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
122
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
123 Output/format options are ALL/EVERYTHING/VERBOSE, RATINGS/SCORES,
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
124 FILENAMES/NAMES/FILES, SUMMARY/REPORT"""
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
125
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
126 __version__ = "$Revision: 1.2 $"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
127 __author__=["David Mertz (mertz@gnosis.cx)",]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
128 __thanks_to__=["Pat Knight (p.knight@ktgroup.co.uk)",
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
129 "Gregory Popovitch (greg@gpy.com)", ]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
130 __copyright__="""
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
131 This file is released to the public domain. I (dqm) would
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
132 appreciate it if you choose to keep derived works under terms
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
133 that promote freedom, but obviously am giving up any rights
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
134 to compel such.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
135 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
136
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
137 __history__="""
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
138 0.1 Initial version.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
139
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
140 0.11 Tweaked TextSplitter after some random experimentation.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
141
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
142 0.12 Added SlicedZPickleIndexer (best choice, so far).
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
143
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
144 0.13 Pat Knight pointed out need for binary open()'s of
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
145 certain files under Windows.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
146
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
147 0.14 Added '-filter' switch to search results.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
148
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
149 0.15 Added direct read of gzip files
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
150
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
151 0.20 Gregory Popovitch did some profiling on TextSplitter,
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
152 and provided both huge speedups to the Python version
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
153 and hooks to a C extension class (ZopeTextSplitter).
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
154 A little refactoring by he and I (dqm) has nearly
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
155 doubled the speed of indexing
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
156
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
157 0.30 Module refactored into gnosis package. This is a
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
158 first pass, and various documentation and test cases
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
159 should be added later.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
160 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
161 import string, re, os, fnmatch, sys, copy, gzip
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
162 from types import *
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
163
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
164 #-- Silly "do nothing" default recursive file processor
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
165 def echo_fname(fname): print fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
166
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
167 #-- "Recurse and process files" utility function
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
168 def recurse_files(curdir, pattern, exclusions, func=echo_fname, *args, **kw):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
169 "Recursively process file pattern"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
170 subdirs, files = [],[]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
171 level = kw.get('level',0)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
172
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
173 for name in os.listdir(curdir):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
174 fname = os.path.join(curdir, name)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
175 if name[-4:] in exclusions:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
176 pass # do not include binary file type
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
177 elif os.path.isdir(fname) and not os.path.islink(fname):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
178 subdirs.append(fname)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
179 # kludge to detect a regular expression across python versions
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
180 elif sys.version[0]=='1' and isinstance(pattern, re.RegexObject):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
181 if pattern.match(name):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
182 files.append(fname)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
183 elif sys.version[0]=='2' and type(pattern)==type(re.compile('')):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
184 if pattern.match(name):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
185 files.append(fname)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
186 elif type(pattern) is StringType:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
187 if fnmatch.fnmatch(name, pattern):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
188 files.append(fname)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
189
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
190 for fname in files:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
191 apply(func, (fname,)+args)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
192 for subdir in subdirs:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
193 recurse_files(subdir, pattern, exclusions, func, level=level+1)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
194
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
195 #-- Data bundle for index dictionaries
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
196 class Index:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
197 def __init__(self, words, files, fileids):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
198 if words is not None: self.WORDS = words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
199 if files is not None: self.FILES = files
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
200 if fileids is not None: self.FILEIDS = fileids
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
201
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
202 #-- "Split plain text into words" utility function
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
203 class TextSplitter:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
204 def initSplitter(self):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
205 prenum = string.join(map(chr, range(0,48)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
206 num2cap = string.join(map(chr, range(58,65)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
207 cap2low = string.join(map(chr, range(91,97)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
208 postlow = string.join(map(chr, range(123,256)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
209 nonword = prenum + num2cap + cap2low + postlow
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
210 self.word_only = string.maketrans(nonword, " "*len(nonword))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
211 self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
212 self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
213 self.ident = string.join(map(chr, range(256)), '')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
214 self.init = 1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
215
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
216 def splitter(self, text, ftype):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
217 "Split the contents of a text string into a list of 'words'"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
218 if ftype == 'text/plain':
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
219 words = self.text_splitter(text, self.casesensitive)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
220 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
221 raise NotImplementedError
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
222 return words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
223
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
224 def text_splitter(self, text, casesensitive=0):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
225 """Split text/plain string into a list of words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
226
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
227 In version 0.20 this function is still fairly weak at
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
228 identifying "real" words, and excluding gibberish
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
229 strings. As long as the indexer looks at "real" text
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
230 files, it does pretty well; but if indexing of binary
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
231 data is attempted, a lot of gibberish gets indexed.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
232 Suggestions on improving this are GREATLY APPRECIATED.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
233 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
234 # Initialize some constants
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
235 if not hasattr(self,'init'): self.initSplitter()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
236
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
237 # Speedup trick: attributes into local scope
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
238 word_only = self.word_only
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
239 ident = self.ident
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
240 alpha = self.alpha
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
241 nondigits = self.nondigits
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
242 translate = string.translate
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
243
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
244 # Let's adjust case if not case-sensitive
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
245 if not casesensitive: text = string.upper(text)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
246
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
247 # Split the raw text
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
248 allwords = string.split(text)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
249
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
250 # Finally, let's skip some words not worth indexing
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
251 words = []
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
252 for word in allwords:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
253 if len(word) > 25: continue # too long (probably gibberish)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
254
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
255 # Identify common patterns in non-word data (binary, UU/MIME, etc)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
256 num_nonalpha = len(word.translate(ident, alpha))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
257 numdigits = len(word.translate(ident, nondigits))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
258 # 1.52: num_nonalpha = len(translate(word, ident, alpha))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
259 # 1.52: numdigits = len(translate(word, ident, nondigits))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
260 if numdigits > len(word)-2: # almost all digits
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
261 if numdigits > 5: # too many digits is gibberish
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
262 continue # a moderate number is year/zipcode/etc
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
263 elif num_nonalpha*3 > len(word): # too much scattered nonalpha = gibberish
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
264 continue
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
265
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
266 word = word.translate(word_only) # Let's strip funny byte values
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
267 # 1.52: word = translate(word, word_only)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
268 subwords = word.split() # maybe embedded non-alphanumeric
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
269 # 1.52: subwords = string.split(word)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
270 for subword in subwords: # ...so we might have subwords
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
271 if len(subword) <= 2: continue # too short a subword
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
272 words.append(subword)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
273 return words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
274
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
275 class ZopeTextSplitter:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
276 def initSplitter(self):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
277 import Splitter
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
278 stop_words=(
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
279 'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above', 'across',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
280 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
281 'along', 'already', 'also', 'although', 'always', 'am', 'among',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
282 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
283 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
284 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
285 'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
286 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
287 'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
288 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
289 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
290 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
291 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
292 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
293 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
294 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
295 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
296 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
297 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
298 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
299 'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
300 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
301 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
302 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
303 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
304 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
305 'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
306 'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
307 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
308 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
309 'somehow', 'someone', 'something', 'sometime', 'sometimes',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
310 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
311 'their', 'them', 'themselves', 'then', 'thence', 'there',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
312 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
313 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
314 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
315 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
316 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
317 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
318 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
319 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
320 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
321 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves',
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
322 )
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
323 self.stop_word_dict={}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
324 for word in stop_words: self.stop_word_dict[word]=None
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
325 self.splitterobj = Splitter.getSplitter()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
326 self.init = 1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
327
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
328 def goodword(self, word):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
329 return len(word) < 25
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
330
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
331 def splitter(self, text, ftype):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
332 """never case-sensitive"""
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
333 if not hasattr(self,'init'): self.initSplitter()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
334 return filter(self.goodword, self.splitterobj(text, self.stop_word_dict))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
335
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
336
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
337 #-- "Abstract" parent class for inherited indexers
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
338 # (does not handle storage in parent, other methods are primitive)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
339
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
340 class GenericIndexer:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
341 def __init__(self, **kw):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
342 apply(self.configure, (), kw)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
343
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
344 def whoami(self):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
345 return self.__class__.__name__
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
346
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
347 def configure(self, REINDEX=0, CASESENSITIVE=0,
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
348 INDEXDB=os.environ.get('INDEXER_DB', 'TEMP_NDX.DB'),
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
349 ADD_PATTERN='*', QUIET=5):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
350 "Configure settings used by indexing and storage/retrieval"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
351 self.indexdb = INDEXDB
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
352 self.reindex = REINDEX
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
353 self.casesensitive = CASESENSITIVE
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
354 self.add_pattern = ADD_PATTERN
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
355 self.quiet = QUIET
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
356 self.filter = None
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
357
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
358 def add_files(self, dir=os.getcwd(), pattern=None, descend=1):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
359 self.load_index()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
360 exclusions = ('.zip','.pyc','.gif','.jpg','.dat','.dir')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
361 if not pattern:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
362 pattern = self.add_pattern
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
363 recurse_files(dir, pattern, exclusions, self.add_file)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
364 # Rebuild the fileid index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
365 self.fileids = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
366 for fname in self.files.keys():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
367 fileid = self.files[fname][0]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
368 self.fileids[fileid] = fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
369
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
370 def add_file(self, fname, ftype='text/plain'):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
371 "Index the contents of a regular file"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
372 if self.files.has_key(fname): # Is file eligible for (re)indexing?
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
373 if self.reindex: # Reindexing enabled, cleanup dicts
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
374 self.purge_entry(fname, self.files, self.words)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
375 else: # DO NOT reindex this file
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
376 if self.quiet < 5: print "Skipping", fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
377 return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
378
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
379 # Read in the file (if possible)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
380 try:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
381 if fname[-3:] == '.gz':
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
382 text = gzip.open(fname).read()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
383 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
384 text = open(fname).read()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
385 if self.quiet < 5: print "Indexing", fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
386 except IOError:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
387 return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
388 words = self.splitter(text, ftype)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
389
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
390 # Find new file index, and assign it to filename
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
391 # (_TOP uses trick of negative to avoid conflict with file index)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
392 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
393 file_index = abs(self.files['_TOP'][0])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
394 self.files[fname] = (file_index, len(words))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
395
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
396 filedict = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
397 for word in words:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
398 if filedict.has_key(word):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
399 filedict[word] = filedict[word]+1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
400 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
401 filedict[word] = 1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
402
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
403 for word in filedict.keys():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
404 if self.words.has_key(word):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
405 entry = self.words[word]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
406 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
407 entry = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
408 entry[file_index] = filedict[word]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
409 self.words[word] = entry
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
410
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
411 def add_othertext(self, identifier):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
412 """Index a textual source other than a plain file
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
413
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
414 A child class might want to implement this method (or a similar one)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
415 in order to index textual sources such as SQL tables, URLs, clay
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
416 tablets, or whatever else. The identifier should uniquely pick out
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
417 the source of the text (whatever it is)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
418 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
419 raise NotImplementedError
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
420
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
421 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
422 raise NotImplementedError
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
423
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
424 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
425 raise NotImplementedError
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
426
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
427 def find(self, wordlist, print_report=0):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
428 "Locate files that match ALL the words in wordlist"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
429 self.load_index(wordlist=wordlist)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
430 entries = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
431 hits = copy.copy(self.fileids) # Copy of fileids index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
432 for word in wordlist:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
433 if not self.casesensitive:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
434 word = string.upper(word)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
435 entry = self.words.get(word) # For each word, get index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
436 entries[word] = entry # of matching files
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
437 if not entry: # Nothing for this one word (fail)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
438 return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
439 for fileid in hits.keys(): # Eliminate hits for every non-match
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
440 if not entry.has_key(fileid):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
441 del hits[fileid]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
442 if print_report:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
443 self.print_report(hits, wordlist, entries)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
444 return hits
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
445
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
446 def print_report(self, hits={}, wordlist=[], entries={}):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
447 # Figure out what to actually print (based on QUIET level)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
448 output = []
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
449 for fileid,fname in hits.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
450 message = fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
451 if self.quiet <= 3:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
452 wordcount = self.files[fname][1]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
453 matches = 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
454 countmess = '\n'+' '*13+`wordcount`+' words; '
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
455 for word in wordlist:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
456 if not self.casesensitive:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
457 word = string.upper(word)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
458 occurs = entries[word][fileid]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
459 matches = matches+occurs
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
460 countmess = countmess +`occurs`+' '+word+'; '
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
461 message = string.ljust('[RATING: '
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
462 +`1000*matches/wordcount`+']',13)+message
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
463 if self.quiet <= 2: message = message +countmess +'\n'
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
464 if self.filter: # Using an output filter
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
465 if fnmatch.fnmatch(message, self.filter):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
466 output.append(message)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
467 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
468 output.append(message)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
469
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
470 if self.quiet <= 5:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
471 print string.join(output,'\n')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
472 sys.stderr.write('\n'+`len(output)`+' files matched wordlist: '+
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
473 `wordlist`+'\n')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
474 return output
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
475
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
476 def purge_entry(self, fname, file_dct, word_dct):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
477 "Remove a file from file index and word index"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
478 try: # The easy part, cleanup the file index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
479 file_index = file_dct[fname]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
480 del file_dct[fname]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
481 except KeyError:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
482 pass # We'll assume we only encounter KeyError's
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
483 # The much harder part, cleanup the word index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
484 for word, occurs in word_dct.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
485 if occurs.has_key(file_index):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
486 del occurs[file_index]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
487 word_dct[word] = occurs
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
488
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
489 def index_loaded(self):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
490 return ( hasattr(self,'fileids') and
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
491 hasattr(self,'files') and
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
492 hasattr(self,'words') )
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
493
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
494 #-- Provide an actual storage facility for the indexes (i.e. shelve)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
495 class ShelveIndexer(GenericIndexer, TextSplitter):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
496 """Concrete Indexer utilizing [shelve] for storage
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
497
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
498 Unfortunately, [shelve] proves far too slow in indexing, while
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
499 creating monstrously large indexes. Not recommend, at least under
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
500 the default dbm's tested. Also, class may be broken because
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
501 shelves do not, apparently, support the .values() and .items()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
502 methods. Fixing this is a low priority, but the sample code is
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
503 left here.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
504 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
505 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
506 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
507 import shelve
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
508 self.words = shelve.open(INDEXDB+".WORDS")
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
509 self.files = shelve.open(INDEXDB+".FILES")
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
510 self.fileids = shelve.open(INDEXDB+".FILEIDS")
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
511 if not FILES: # New index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
512 self.files['_TOP'] = (0,None)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
513
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
514 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
515 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
516 pass
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
517
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
518 class FlatIndexer(GenericIndexer, TextSplitter):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
519 """Concrete Indexer utilizing flat-file for storage
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
520
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
521 See the comments in the referenced article for details; in
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
522 brief, this indexer has about the same timing as the best in
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
523 -creating- indexes and the storage requirements are
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
524 reasonable. However, actually -using- a flat-file index is
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
525 more than an order of magnitude worse than the best indexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
526 (ZPickleIndexer wins overall).
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
527
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
528 On the other hand, FlatIndexer creates a wonderfully easy to
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
529 parse database format if you have a reason to transport the
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
530 index to a different platform or programming language. And
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
531 should you perform indexing as part of a long-running
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
532 process, the overhead of initial file parsing becomes
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
533 irrelevant.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
534 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
535 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
536 # Unless reload is indicated, do not load twice
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
537 if self.index_loaded() and not reload: return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
538 # Ok, now let's actually load it
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
539 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
540 self.words = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
541 self.files = {'_TOP':(0,None)}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
542 self.fileids = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
543 try: # Read index contents
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
544 for line in open(INDEXDB).readlines():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
545 fields = string.split(line)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
546 if fields[0] == '-': # Read a file/fileid line
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
547 fileid = eval(fields[2])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
548 wordcount = eval(fields[3])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
549 fname = fields[1]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
550 self.files[fname] = (fileid, wordcount)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
551 self.fileids[fileid] = fname
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
552 else: # Read a word entry (dict of hits)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
553 entries = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
554 word = fields[0]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
555 for n in range(1,len(fields),2):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
556 fileid = eval(fields[n])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
557 occurs = eval(fields[n+1])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
558 entries[fileid] = occurs
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
559 self.words[word] = entries
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
560 except:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
561 pass # New index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
562
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
563 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
564 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
565 tab, lf, sp = '\t','\n',' '
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
566 indexdb = open(INDEXDB,'w')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
567 for fname,entry in self.files.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
568 indexdb.write('- '+fname +tab +`entry[0]` +tab +`entry[1]` +lf)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
569 for word,entry in self.words.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
570 indexdb.write(word +tab+tab)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
571 for fileid,occurs in entry.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
572 indexdb.write(`fileid` +sp +`occurs` +sp)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
573 indexdb.write(lf)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
574
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
575 class PickleIndexer(GenericIndexer, TextSplitter):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
576 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
577 # Unless reload is indicated, do not load twice
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
578 if self.index_loaded() and not reload: return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
579 # Ok, now let's actually load it
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
580 import cPickle
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
581 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
582 try:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
583 pickle_str = open(INDEXDB,'rb').read()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
584 db = cPickle.loads(pickle_str)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
585 except: # New index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
586 db = Index({}, {'_TOP':(0,None)}, {})
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
587 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
588
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
589 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
590 import cPickle
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
591 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
592 db = Index(self.words, self.files, self.fileids)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
593 open(INDEXDB,'wb').write(cPickle.dumps(db, 1))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
594
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
595 class XMLPickleIndexer(PickleIndexer):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
596 """Concrete Indexer utilizing XML for storage
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
597
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
598 While this is, as expected, a verbose format, the possibility
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
599 of using XML as a transport format for indexes might be
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
600 useful. However, [xml_pickle] is in need of some redesign to
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
601 avoid gross inefficiency when creating very large
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
602 (multi-megabyte) output files (fixed in [xml_pickle] version
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
603 0.48 or above)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
604 """
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
605 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
606 # Unless reload is indicated, do not load twice
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
607 if self.index_loaded() and not reload: return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
608 # Ok, now let's actually load it
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
609 from gnosis.xml.pickle import XML_Pickler
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
610 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
611 try: # XML file exists
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
612 xml_str = open(INDEXDB).read()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
613 db = XML_Pickler().loads(xml_str)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
614 except: # New index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
615 db = Index({}, {'_TOP':(0,None)}, {})
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
616 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
617
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
618 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
619 from gnosis.xml.pickle import XML_Pickler
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
620 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
621 db = Index(self.words, self.files, self.fileids)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
622 open(INDEXDB,'w').write(XML_Pickler(db).dumps())
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
623
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
624 class ZPickleIndexer(PickleIndexer):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
625 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
626 # Unless reload is indicated, do not load twice
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
627 if self.index_loaded() and not reload: return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
628 # Ok, now let's actually load it
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
629 import cPickle, zlib
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
630 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
631 try:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
632 pickle_str = zlib.decompress(open(INDEXDB+'!','rb').read())
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
633 db = cPickle.loads(pickle_str)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
634 except: # New index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
635 db = Index({}, {'_TOP':(0,None)}, {})
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
636 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
637
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
638 def save_index(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
639 import cPickle, zlib
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
640 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
641 db = Index(self.words, self.files, self.fileids)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
642 pickle_fh = open(INDEXDB+'!','wb')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
643 pickle_fh.write(zlib.compress(cPickle.dumps(db, 1)))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
644
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
645
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
646 class SlicedZPickleIndexer(ZPickleIndexer):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
647 segments = "ABCDEFGHIJKLMNOPQRSTUVWXYZ#-!"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
648 def load_index(self, INDEXDB=None, reload=0, wordlist=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
649 # Unless reload is indicated, do not load twice
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
650 if self.index_loaded() and not reload: return 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
651 # Ok, now let's actually load it
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
652 import cPickle, zlib
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
653 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
654 db = Index({}, {'_TOP':(0,None)}, {})
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
655 # Identify the relevant word-dictionary segments
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
656 if not wordlist:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
657 segments = self.segments
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
658 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
659 segments = ['-','#']
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
660 for word in wordlist:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
661 segments.append(string.upper(word[0]))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
662 # Load the segments
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
663 for segment in segments:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
664 try:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
665 pickle_str = zlib.decompress(open(INDEXDB+segment,'rb').read())
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
666 dbslice = cPickle.loads(pickle_str)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
667 if dbslice.__dict__.get('WORDS'): # If it has some words, add them
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
668 for word,entry in dbslice.WORDS.items():
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
669 db.WORDS[word] = entry
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
670 if dbslice.__dict__.get('FILES'): # If it has some files, add them
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
671 db.FILES = dbslice.FILES
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
672 if dbslice.__dict__.get('FILEIDS'): # If it has fileids, add them
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
673 db.FILEIDS = dbslice.FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
674 except:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
675 pass # No biggie, couldn't find this segment
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
676 self.words, self.files, self.fileids = db.WORDS, db.FILES, db.FILEIDS
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
677
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
678 def julienne(self, INDEXDB=None):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
679 import cPickle, zlib
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
680 INDEXDB = INDEXDB or self.indexdb
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
681 segments = self.segments # all the (little) indexes
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
682 for segment in segments:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
683 try: # brutal space saver... delete all the small segments
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
684 os.remove(INDEXDB+segment)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
685 except OSError:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
686 pass # probably just nonexistent segment index file
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
687 # First write the much simpler filename/fileid dictionaries
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
688 dbfil = Index(None, self.files, self.fileids)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
689 open(INDEXDB+'-','wb').write(zlib.compress(cPickle.dumps(dbfil,1)))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
690 # The hard part is splitting the word dictionary up, of course
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
691 letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
692 segdicts = {} # Need batch of empty dicts
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
693 for segment in letters+'#':
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
694 segdicts[segment] = {}
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
695 for word, entry in self.words.items(): # Split into segment dicts
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
696 initchar = string.upper(word[0])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
697 if initchar in letters:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
698 segdicts[initchar][word] = entry
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
699 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
700 segdicts['#'][word] = entry
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
701 for initchar in letters+'#':
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
702 db = Index(segdicts[initchar], None, None)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
703 pickle_str = cPickle.dumps(db, 1)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
704 filename = INDEXDB+initchar
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
705 pickle_fh = open(filename,'wb')
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
706 pickle_fh.write(zlib.compress(pickle_str))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
707 os.chmod(filename,0664)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
708
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
709 save_index = julienne
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
710
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
711 PreferredIndexer = SlicedZPickleIndexer
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
712
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
713 #-- If called from command-line, parse arguments and take actions
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
714 if __name__ == '__main__':
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
715 import time
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
716 start = time.time()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
717 search_words = [] # Word search list (if specified)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
718 opts = 0 # Any options specified?
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
719 if len(sys.argv) < 2:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
720 pass # No options given
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
721 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
722 upper = string.upper
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
723 dir = os.getcwd() # Default to indexing from current directory
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
724 descend = 1 # Default to recursive indexing
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
725 ndx = PreferredIndexer()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
726 for opt in sys.argv[1:]:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
727 if opt in ('-h','/h','-?','/?','?','--help'): # help screen
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
728 print __shell_usage__
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
729 opts = -1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
730 break
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
731 elif opt[0] in '/-': # a switch!
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
732 opts = opts+1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
733 if upper(opt[1:]) == 'INDEX': # Index files
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
734 ndx.quiet = 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
735 pass # Use defaults if no other options
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
736 elif upper(opt[1:]) == 'REINDEX': # Reindex
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
737 ndx.reindex = 1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
738 elif upper(opt[1:]) == 'CASESENSITIVE': # Case sensitive
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
739 ndx.casesensitive = 1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
740 elif upper(opt[1:]) in ('NORECURSE','LOCAL'): # No recursion
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
741 descend = 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
742 elif upper(opt[1:4]) == 'DIR': # Dir to index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
743 dir = opt[5:]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
744 elif upper(opt[1:8]) == 'INDEXDB': # Index specified
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
745 ndx.indexdb = opt[9:]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
746 sys.stderr.write(
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
747 "Use of INDEXER_DB environment variable is STRONGLY recommended.\n")
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
748 elif upper(opt[1:6]) == 'REGEX': # RegEx files to index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
749 ndx.add_pattern = re.compile(opt[7:])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
750 elif upper(opt[1:5]) == 'GLOB': # Glob files to index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
751 ndx.add_pattern = opt[6:]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
752 elif upper(opt[1:7]) in ('OUTPUT','FORMAT'): # How should results look?
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
753 opts = opts-1 # this is not an option for indexing purposes
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
754 level = upper(opt[8:])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
755 if level in ('ALL','EVERYTHING','VERBOSE', 'MAX'):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
756 ndx.quiet = 0
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
757 elif level in ('RATINGS','SCORES','HIGH'):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
758 ndx.quiet = 3
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
759 elif level in ('FILENAMES','NAMES','FILES','MID'):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
760 ndx.quiet = 5
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
761 elif level in ('SUMMARY','MIN'):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
762 ndx.quiet = 9
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
763 elif upper(opt[1:7]) == 'FILTER': # Regex filter output
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
764 opts = opts-1 # this is not an option for indexing purposes
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
765 ndx.filter = opt[8:]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
766 elif opt[1:] in string.digits:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
767 opts = opts-1
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
768 ndx.quiet = eval(opt[1])
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
769 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
770 search_words.append(opt) # Search words
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
771
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
772 if opts > 0:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
773 ndx.add_files(dir=dir)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
774 ndx.save_index()
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
775 if search_words:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
776 ndx.find(search_words, print_report=1)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
777 if not opts and not search_words:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
778 sys.stderr.write("Perhaps you would like to use the --help option?\n")
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
779 else:
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
780 sys.stderr.write('Processed in %.3f seconds (%s)'
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
781 % (time.time()-start, ndx.whoami()))
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
782
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
783 #
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
784 #$Log: not supported by cvs2svn $
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
785 #Revision 1.1.2.3 2002/04/03 12:05:15 rochecompaan
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
786 #Removed dos control characters.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
787 #
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
788 #Revision 1.1.2.2 2002/04/03 12:01:55 rochecompaan
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
789 #Oops. Forgot to include cvs keywords in file.
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
790 #

Roundup Issue Tracker: http://roundup-tracker.org/