Mercurial > p > roundup > code
annotate roundup/backends/indexer_xapian.py @ 4944:fe2d72103cc8
Fix two line-break accidents in devel and responsive milestone.item.html
| author | Thomas Arendsen Hein <thomas@intevation.de> |
|---|---|
| date | Tue, 25 Nov 2014 16:04:17 +0100 |
| parents | 3ff1a288fb9c |
| children | 67fad01d2009 |
| rev | line source |
|---|---|
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
1 ''' This implements the full-text indexer using the Xapian indexer. |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
2 ''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
3 import re, os |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
4 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
5 import xapian |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
6 |
|
3544
5cd1c83dea50
Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents:
3295
diff
changeset
|
7 from roundup.backends.indexer_common import Indexer as IndexerBase |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
8 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
9 # TODO: we need to delete documents when a property is *reindexed* |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
10 |
|
3544
5cd1c83dea50
Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents:
3295
diff
changeset
|
11 class Indexer(IndexerBase): |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
12 def __init__(self, db): |
|
3544
5cd1c83dea50
Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents:
3295
diff
changeset
|
13 IndexerBase.__init__(self, db) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
14 self.db_path = db.config.DATABASE |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
15 self.reindex = 0 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
16 self.transaction_active = False |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
17 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
18 def _get_database(self): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
19 index = os.path.join(self.db_path, 'text-index') |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
20 return xapian.WritableDatabase(index, xapian.DB_CREATE_OR_OPEN) |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
21 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
22 def save_index(self): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
23 '''Save the changes to the index.''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
24 if not self.transaction_active: |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
25 return |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
26 database = self._get_database() |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
27 database.commit_transaction() |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
28 self.transaction_active = False |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
29 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
30 def close(self): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
31 '''close the indexing database''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
32 pass |
|
3887
c7363442cdbb
change xapian stemmer to use "new" API
Justus Pendleton <jpend@users.sourceforge.net>
parents:
3555
diff
changeset
|
33 |
|
3555
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
34 def rollback(self): |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
35 if not self.transaction_active: |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
36 return |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
37 database = self._get_database() |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
38 database.cancel_transaction() |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
39 self.transaction_active = False |
|
91c495476db3
pre-release stuff and test fix
Richard Jones <richard@users.sourceforge.net>
parents:
3547
diff
changeset
|
40 |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
41 def force_reindex(self): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
42 '''Force a reindexing of the database. This essentially |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
43 empties the tables ids and index and sets a flag so |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
44 that the databases are reindexed''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
45 self.reindex = 1 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
46 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
47 def should_reindex(self): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
48 '''returns True if the indexes need to be rebuilt''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
49 return self.reindex |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
50 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
51 def add_text(self, identifier, text, mime_type='text/plain'): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
52 ''' "identifier" is (classname, itemid, property) ''' |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
53 if mime_type != 'text/plain': |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
54 return |
|
3547
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
55 if not text: text = '' |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
56 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
57 # open the database and start a transaction if needed |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
58 database = self._get_database() |
|
4378
477f2a47cbca
- Indexer Xapian, made Xapian 1.2 compatible.
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4252
diff
changeset
|
59 |
|
477f2a47cbca
- Indexer Xapian, made Xapian 1.2 compatible.
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4252
diff
changeset
|
60 # XXX: Xapian now supports transactions, |
|
477f2a47cbca
- Indexer Xapian, made Xapian 1.2 compatible.
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4252
diff
changeset
|
61 # but there is a call to save_index() missing. |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
62 #if not self.transaction_active: |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
63 #database.begin_transaction() |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
64 #self.transaction_active = True |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
65 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
66 # TODO: allow configuration of other languages |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
67 stemmer = xapian.Stem("english") |
|
3547
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
68 |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
69 # We use the identifier twice: once in the actual "text" being |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
70 # indexed so we can search on it, and again as the "data" being |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
71 # indexed so we know what we're matching when we get results |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
72 identifier = '%s:%s:%s'%identifier |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
73 |
|
3547
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
74 # create the new document |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
75 doc = xapian.Document() |
|
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
76 doc.set_data(identifier) |
|
4511
931370d96c34
Xapian indexing improved:
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4470
diff
changeset
|
77 doc.add_term(identifier, 0) |
|
3547
7728ee93efd2
fix reindexing in Xapian
Richard Jones <richard@users.sourceforge.net>
parents:
3544
diff
changeset
|
78 |
|
4252
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
79 for match in re.finditer(r'\b\w{%d,%d}\b' |
|
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
80 % (self.minlength, self.maxlength), |
|
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
81 text.upper()): |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
82 word = match.group(0) |
|
3544
5cd1c83dea50
Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents:
3295
diff
changeset
|
83 if self.is_stopword(word): |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
84 continue |
|
3887
c7363442cdbb
change xapian stemmer to use "new" API
Justus Pendleton <jpend@users.sourceforge.net>
parents:
3555
diff
changeset
|
85 term = stemmer(word) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
86 doc.add_posting(term, match.start(0)) |
|
4511
931370d96c34
Xapian indexing improved:
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4470
diff
changeset
|
87 |
|
931370d96c34
Xapian indexing improved:
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4470
diff
changeset
|
88 database.replace_document(identifier, doc) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
89 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
90 def find(self, wordlist): |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
91 '''look up all the words in the wordlist. |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
92 If none are found return an empty dictionary |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
93 * more rules here |
|
3887
c7363442cdbb
change xapian stemmer to use "new" API
Justus Pendleton <jpend@users.sourceforge.net>
parents:
3555
diff
changeset
|
94 ''' |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
95 if not wordlist: |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
96 return {} |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
97 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
98 database = self._get_database() |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
99 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
100 enquire = xapian.Enquire(database) |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
101 stemmer = xapian.Stem("english") |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
102 terms = [] |
|
4252
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
103 for term in [word.upper() for word in wordlist |
|
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
104 if self.minlength <= len(word) <= self.maxlength]: |
|
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
105 if not self.is_stopword(term): |
|
2ff6f39aa391
Indexers behaviour made more consistent regarding length of indexed words...
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
3932
diff
changeset
|
106 terms.append(stemmer(term)) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
107 query = xapian.Query(xapian.Query.OP_AND, terms) |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
108 |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
109 enquire.set_query(query) |
|
4841
3ff1a288fb9c
issue2550583, issue2550635 Do not limit results with Xapian indexer
Thomas Arendsen Hein <thomas@intevation.de>
parents:
4570
diff
changeset
|
110 matches = enquire.get_mset(0, database.get_doccount()) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
111 |
|
4470
21a95ba01a42
Fix search for xapian 1.2 issue2550676.
Bernhard Reiter <Bernhard.Reiter@intevation.de>
parents:
4378
diff
changeset
|
112 return [tuple(m.document.get_data().split(':')) |
|
3295
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
113 for m in matches] |
|
a615cc230160
added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents:
diff
changeset
|
114 |
