changeset 5096:e74c3611b138

- issue2550636, issue2550909: Added support for Whoosh indexer. Also adds new config.ini setting called indexer to select indexer. See ``doc/upgrading.txt`` for details. Initial patch done by David Wolever. Patch modified (see ticket or below for changes), docs updated and committed. I have an outstanding issue with test/test_indexer.py. I have to comment out all imports and tests for indexers I don't have (i.e. mysql, postgres) otherwise no tests run. With that change made, dbm, sqlite (rdbms), xapian and whoosh indexes are all passing the indexer tests. Changes summary: 1) support native back ends dbm and rdbms. (original patch only fell through to dbm) 2) Developed whoosh stopfilter to not index stopwords or words outside the the maxlength and minlength limits defined in index_common.py. Required to pass the extremewords test_indexer test. Also I removed a call to .lower on the input text as the tokenizer I chose automatically does the lowercase. 3) Added support for max/min length to find. This was needed to pass extremewords test. 4) Added back a call to save_index in add_text. This allowed all but two tests to pass. 5) Fixed a call to: results = searcher.search(query.Term("identifier", identifier)) which had an extra parameter that is an error under current whoosh. 6) Set limit=None in search call for find() otherwise it only return 10 items. This allowed it to pass manyresults test Also due to changes in the roundup code removed the call in indexer_whoosh to from roundup.anypy.sets_ import set since we use the python builtin set.
author John Rouillard <rouilj@ieee.org>
date Sat, 25 Jun 2016 20:10:03 -0400
parents d3ba0b254dbb
children 156cbc1d182c
files CHANGES.txt doc/features.txt doc/installation.txt doc/upgrading.txt roundup/backends/back_anydbm.py roundup/backends/back_mysql.py roundup/backends/back_postgresql.py roundup/backends/back_sqlite.py roundup/backends/indexer_common.py roundup/backends/indexer_whoosh.py roundup/backends/rdbms_common.py roundup/configuration.py test/test_indexer.py
diffstat 13 files changed, 275 insertions(+), 12 deletions(-) [+]
line wrap: on
line diff
--- a/CHANGES.txt	Wed Jun 22 21:29:14 2016 -0400
+++ b/CHANGES.txt	Sat Jun 25 20:10:03 2016 -0400
@@ -73,6 +73,11 @@
   for description. Merge request at:
     https://sourceforge.net/p/roundup/code/merge-requests/1/
   Patch supplied by kinggreedy. Applied/tested by John Rouillard
+- issue2550636, issue2550909: Added support for Whoosh indexer.
+  Also adds new config.ini setting called indexer to select
+  indexer. See ``doc/upgrading.txt`` for details. Initial patch
+  done by David Wolever. Patch modified, docs added and committed
+  by John Rouillard.
 
 Fixed:
 
--- a/doc/features.txt	Wed Jun 22 21:29:14 2016 -0400
+++ b/doc/features.txt	Sat Jun 25 20:10:03 2016 -0400
@@ -47,7 +47,7 @@
    support them (sqlite, mysql and postgresql)
  - indexed text searching giving fast responses to searches across all
    messages and indexed string properties
- - support for the Xapian full-text indexing engine for large trackers
+ - support for the Xapian or Whoosh full-text indexing engine for large trackers
 
 *documented*
  - documentation exists for installation, upgrading, maintenance, users and
--- a/doc/installation.txt	Wed Jun 22 21:29:14 2016 -0400
+++ b/doc/installation.txt	Sat Jun 25 20:10:03 2016 -0400
@@ -67,6 +67,20 @@
 
   Roundup requires Xapian 1.0.0 or newer.
 
+Whoosh full-text indexer
+  The Whoosh_ full-text indexer is also supported and will be used by
+  default if it is available (and Xapian is not installed). This is
+  recommended if you are anticipating a large number of issues (> 5000).
+
+  You may install Whoosh at any time, even after a tracker has been
+  installed and used. You will need to run the "roundup-admin reindex"
+  command if the tracker has existing data.
+
+  Roundup was tested with Whoosh 2.5.7, but earlier versions in the
+  2.0 series may work. Whoosh is a pure python indexer so it is slower
+  than Xapian, but should be useful for moderately sized trackers.
+  It uses the StandardAnalyzer which is suited for Western languages.
+
 pyopenssl
   If pyopenssl_ is installed the roundup-server can be configured
   to serve trackers over SSL. If you are going to serve roundup via
@@ -88,6 +102,7 @@
   You can run Roundup as a Windows service if pywin32_ is installed.
 
 .. _Xapian: http://xapian.org/
+.. _Whoosh: https://bitbucket.org/mchaput/whoosh/wiki/Home
 .. _pytz: http://www.python.org/pypi/pytz
 .. _Olson tz database: http://www.twinsun.com/tz/tz-link.htm
 .. _pyopenssl: http://pyopenssl.sourceforge.net
--- a/doc/upgrading.txt	Wed Jun 22 21:29:14 2016 -0400
+++ b/doc/upgrading.txt	Sat Jun 25 20:10:03 2016 -0400
@@ -30,7 +30,7 @@
 backend being used for a tracker. The backend is now configured in the
 ``config.ini`` file using the ``backend`` option located in the ``[rdbms]``
 section. For example if ``db/backend_name`` file contains ``sqlite``, a new
-entry in the ``config.ini`` will need to be created::
+entry in the tracker's ``config.ini`` will need to be created::
 
   [rdbms]
 
@@ -47,6 +47,24 @@
 ``db/`` if you have configured the ``database`` option in the ``[main]``
 section of the ``config.ini`` file to be something other than ``db``.
 
+New config file option 'indexer' added
+--------------------------------------
+
+With support for the Whoosh indexer, a new config file option has been
+added. You can force Roundup to use a particular text indexer by
+setting this value in the [main] section of the tracker's
+``config.ini`` file (usually placed right before indexer_stopwords)::
+
+  [main]
+
+  ...
+
+  # Force Roundup to use a particular text indexer.
+  # If no indexer is supplied, the first available indexer
+  # will be used in the following order:
+  # Possible values: xapian, whoosh, native (internal).
+  indexer =
+
 html/_generic.404.html in trackers use page template
 ----------------------------------------------------
 
--- a/roundup/backends/back_anydbm.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/back_anydbm.py	Sat Jun 25 20:10:03 2016 -0400
@@ -33,10 +33,7 @@
 from roundup.backends.blobfiles import FileStorage
 from roundup.backends.sessions_dbm import Sessions, OneTimeKeys
 
-try:
-    from roundup.backends.indexer_xapian import Indexer
-except ImportError:
-    from roundup.backends.indexer_dbm import Indexer
+from roundup.backends.indexer_common import get_indexer
 
 def db_exists(config):
     # check for the user db
@@ -140,7 +137,17 @@
     - check the timestamp of the class file and nuke the cache if it's
       modified. Do some sort of conflict checking on the dirty stuff.
     - perhaps detect write collisions (related to above)?
+
+    attributes:
+      dbtype:
+        holds the value for the type of db. It is used by indexer to
+        identify the database type so it can import the correct indexer
+        module when using native text search mode.
     """
+
+    dbtype = "anydbm"
+
+
     def __init__(self, config, journaltag=None):
         """Open a hyperdatabase given a specifier to some storage.
 
@@ -167,7 +174,7 @@
         self.newnodes = {}      # keep track of the new nodes by class
         self.destroyednodes = {}# keep track of the destroyed nodes by class
         self.transactions = []
-        self.indexer = Indexer(self)
+        self.indexer = get_indexer(config, self)
         self.security = security.Security(self)
         os.umask(config.UMASK)
 
--- a/roundup/backends/back_mysql.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/back_mysql.py	Sat Jun 25 20:10:03 2016 -0400
@@ -110,8 +110,19 @@
 
 
 class Database(rdbms_common.Database):
+    """ Mysql DB backend implementation
+
+    attributes:
+      dbtype:
+        holds the value for the type of db. It is used by indexer to
+        identify the database type so it can import the correct indexer
+        module when using native text search mode.
+    """
+
     arg = '%s'
 
+    dbtype = "mysql"
+
     # used by some code to switch styles of query
     implements_intersect = 0
 
--- a/roundup/backends/back_postgresql.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/back_postgresql.py	Sat Jun 25 20:10:03 2016 -0400
@@ -151,8 +151,19 @@
                 self.db.rollback()
 
 class Database(rdbms_common.Database):
+    """Postgres DB backend implementation
+
+    attributes:
+      dbtype:
+        holds the value for the type of db. It is used by indexer to
+        identify the database type so it can import the correct indexer
+        module when using native text search mode.
+    """
+
     arg = '%s'
 
+    dbtype = "postgres"
+
     # used by some code to switch styles of query
     implements_intersect = 1
 
--- a/roundup/backends/back_sqlite.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/back_sqlite.py	Sat Jun 25 20:10:03 2016 -0400
@@ -34,12 +34,23 @@
     shutil.rmtree(config.DATABASE)
 
 class Database(rdbms_common.Database):
+    """Sqlite DB backend implementation
+
+    attributes:
+      dbtype:
+        holds the value for the type of db. It is used by indexer to
+        identify the database type so it can import the correct indexer
+        module when using native text search mode.
+    """
+
     # char to use for positional arguments
     if sqlite_version in (2,3):
         arg = '?'
     else:
         arg = '%s'
 
+    dbtype = "sqlite"
+
     # used by some code to switch styles of query
     implements_intersect = 1
 
--- a/roundup/backends/indexer_common.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/indexer_common.py	Sat Jun 25 20:10:03 2016 -0400
@@ -107,3 +107,41 @@
                             node_dict[linkprop].append(nodeid)
         return nodeids
 
+def get_indexer(config, db):
+    indexer_name = getattr(config, "INDEXER", "")
+    if not indexer_name:
+        # Try everything
+        try:
+            from indexer_xapian import Indexer
+            return Indexer(db)
+        except ImportError:
+            pass
+
+        try:
+            from indexer_whoosh import Indexer
+            return Indexer(db)
+        except ImportError:
+            pass
+
+        indexer_name = "native" # fallback to native full text search
+
+    if indexer_name == "xapian":
+        from indexer_xapian import Indexer
+        return Indexer(db)
+
+    if indexer_name == "whoosh":
+        from indexer_whoosh import Indexer
+        return Indexer(db)
+
+    if indexer_name == "native":
+        # load proper native indexing based on database type
+        if db.dbtype == "anydbm":
+            from roundup.backends.indexer_dbm import Indexer
+            return Indexer(db)
+
+        if db.dbtype in ("sqlite", "postgres", "mysql"):
+            from roundup.backends.indexer_rdbms import Indexer
+            return Indexer(db)
+
+    raise AssertionError("Invalid indexer: %r" %(indexer_name))
+
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/roundup/backends/indexer_whoosh.py	Sat Jun 25 20:10:03 2016 -0400
@@ -0,0 +1,129 @@
+''' This implements the full-text indexer using Whoosh.
+'''
+import re, os
+
+from whoosh import fields, qparser, index, query, analysis
+
+from roundup.backends.indexer_common import Indexer as IndexerBase
+
+class Indexer(IndexerBase):
+    def __init__(self, db):
+        IndexerBase.__init__(self, db)
+        self.db_path = db.config.DATABASE
+        self.reindex = 0
+        self.writer = None
+        self.index = None
+        self.deleted = set()
+
+    def _get_index(self):
+        if self.index is None:
+            path = os.path.join(self.db_path, 'whoosh-index')
+            if not os.path.exists(path):
+                # StandardAnalyzer lowercases all words and configure it to
+                # block stopwords and words with lengths not between
+                # self.minlength and self.maxlength from indexer_common
+                stopfilter =  analysis.StandardAnalyzer( #stoplist=self.stopwords,
+                                                        minsize=self.minlength,
+                                                        maxsize=self.maxlength)
+                os.mkdir(path)
+                schema = fields.Schema(identifier=fields.ID(stored=True,
+                                                            unique=True),
+                                       content=fields.TEXT(analyzer=stopfilter))
+                index.create_in(path, schema)
+            self.index = index.open_dir(path)
+        return self.index
+
+    def save_index(self):
+        '''Save the changes to the index.'''
+        if not self.writer:
+            return
+        self.writer.commit()
+        self.deleted = set()
+        self.writer = None
+
+    def close(self):
+        '''close the indexing database'''
+        pass
+
+    def rollback(self):
+        if not self.writer:
+            return
+        self.writer.cancel()
+        self.deleted = set()
+        self.writer = None
+
+    def force_reindex(self):
+        '''Force a reindexing of the database.  This essentially
+        empties the tables ids and index and sets a flag so
+        that the databases are reindexed'''
+        self.reindex = 1
+
+    def should_reindex(self):
+        '''returns True if the indexes need to be rebuilt'''
+        return self.reindex
+
+    def _get_writer(self):
+        if self.writer is None:
+            self.writer = self._get_index().writer()
+        return self.writer
+
+    def _get_searcher(self):
+        return self._get_index().searcher()
+
+    def add_text(self, identifier, text, mime_type='text/plain'):
+        ''' "identifier" is  (classname, itemid, property) '''
+        if mime_type != 'text/plain':
+            return
+
+        if not text:
+            text = u''
+
+        if not isinstance(text, unicode):
+            text = unicode(text, "utf-8", "replace")
+
+        # We use the identifier twice: once in the actual "text" being
+        # indexed so we can search on it, and again as the "data" being
+        # indexed so we know what we're matching when we get results
+        identifier = u"%s:%s:%s"%identifier
+
+        # FIXME need to enhance this to handle the whoosh.store.LockError
+        # that maybe raised if there is already another process with a lock.
+        writer = self._get_writer()
+
+        # Whoosh gets upset if a document is deleted twice in one transaction,
+        # so we keep a list of the documents we have so far deleted to make
+        # sure that we only delete them once.
+        if identifier not in self.deleted:
+            searcher = self._get_searcher()
+            results = searcher.search(query.Term("identifier", identifier))
+            if len(results) > 0:
+                writer.delete_by_term("identifier", identifier)
+                self.deleted.add(identifier)
+
+        # Note: use '.lower()' because it seems like Whoosh gets
+        # better results that way.
+        writer.add_document(identifier=identifier, content=text)
+        self.save_index()
+
+    def find(self, wordlist):
+        '''look up all the words in the wordlist.
+        If none are found return an empty dictionary
+        * more rules here
+        '''
+
+        wordlist = [ word for word in wordlist
+                     if (self.minlength <= len(word) <= self.maxlength) and
+                        not self.is_stopword(word.upper()) ]
+
+        if not wordlist:
+            return {}
+
+        searcher = self._get_searcher()
+        q = query.And([ query.FuzzyTerm("content", word.lower())
+                        for word in wordlist ])
+
+        results = searcher.search(q, limit=None)
+
+        return [tuple(result["identifier"].split(':'))
+                for result in results]
+
--- a/roundup/backends/rdbms_common.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/backends/rdbms_common.py	Sat Jun 25 20:10:03 2016 -0400
@@ -64,10 +64,7 @@
 
 # support
 from roundup.backends.blobfiles import FileStorage
-try:
-    from roundup.backends.indexer_xapian import Indexer
-except ImportError:
-    from roundup.backends.indexer_rdbms import Indexer
+from roundup.backends.indexer_common import get_indexer
 from roundup.backends.sessions_rdbms import Sessions, OneTimeKeys
 from roundup.date import Range
 
@@ -172,7 +169,7 @@
         self.config, self.journaltag = config, journaltag
         self.dir = config.DATABASE
         self.classes = {}
-        self.indexer = Indexer(self)
+        self.indexer = get_indexer(config, self)
         self.security = security.Security(self)
 
         # additional transaction support for external files and the like
--- a/roundup/configuration.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/roundup/configuration.py	Sat Jun 25 20:10:03 2016 -0400
@@ -540,6 +540,11 @@
             "email?"),
         (BooleanOption, "email_registration_confirmation", "yes",
             "Offer registration confirmation by email or only through the web?"),
+        (Option, "indexer", "",
+            "Force Roundup to use a particular text indexer.\n"
+            "If no indexer is supplied, the first available indexer\n"
+            "will be used in the following order:\n"
+            "Possible values: xapian, whoosh, native (internal)."),
         (WordListOption, "indexer_stopwords", "",
             "Additional stop-words for the full-text indexer specific to\n"
             "your tracker. See the indexer source for the default list of\n"
--- a/test/test_indexer.py	Wed Jun 22 21:29:14 2016 -0400
+++ b/test/test_indexer.py	Sat Jun 25 20:10:03 2016 -0400
@@ -39,6 +39,12 @@
     skip_xapian = pytest.skip(
         "Skipping Xapian indexer tests: 'xapian' not installed")
 
+try:
+    import whoosh
+    skip_whoosh = lambda func, *args, **kwargs: func
+except ImportError:
+    skip_whoosh = pytest.skip(
+        "Skipping Whoosh indexer tests: 'whoosh' not installed")
 
 class db:
     class config(dict):
@@ -150,6 +156,16 @@
     def tearDown(self):
         shutil.rmtree('test-index')
 
+@skip_whoosh
+class WhooshIndexerTest(IndexerTest):
+    def setUp(self):
+        if os.path.exists('test-index'):
+            shutil.rmtree('test-index')
+        os.mkdir('test-index')
+        from roundup.backends.indexer_whoosh import Indexer
+        self.dex = Indexer(db)
+    def tearDown(self):
+        shutil.rmtree('test-index')
 
 @skip_xapian
 class XapianIndexerTest(IndexerTest):

Roundup Issue Tracker: http://roundup-tracker.org/