view roundup/token.py @ 5096:e74c3611b138

- issue2550636, issue2550909: Added support for Whoosh indexer. Also adds new config.ini setting called indexer to select indexer. See ``doc/upgrading.txt`` for details. Initial patch done by David Wolever. Patch modified (see ticket or below for changes), docs updated and committed. I have an outstanding issue with test/test_indexer.py. I have to comment out all imports and tests for indexers I don't have (i.e. mysql, postgres) otherwise no tests run. With that change made, dbm, sqlite (rdbms), xapian and whoosh indexes are all passing the indexer tests. Changes summary: 1) support native back ends dbm and rdbms. (original patch only fell through to dbm) 2) Developed whoosh stopfilter to not index stopwords or words outside the the maxlength and minlength limits defined in index_common.py. Required to pass the extremewords test_indexer test. Also I removed a call to .lower on the input text as the tokenizer I chose automatically does the lowercase. 3) Added support for max/min length to find. This was needed to pass extremewords test. 4) Added back a call to save_index in add_text. This allowed all but two tests to pass. 5) Fixed a call to: results = searcher.search(query.Term("identifier", identifier)) which had an extra parameter that is an error under current whoosh. 6) Set limit=None in search call for find() otherwise it only return 10 items. This allowed it to pass manyresults test Also due to changes in the roundup code removed the call in indexer_whoosh to from roundup.anypy.sets_ import set since we use the python builtin set.
author John Rouillard <rouilj@ieee.org>
date Sat, 25 Jun 2016 20:10:03 -0400
parents 6e3e4f24c753
children bc16d91b7a50
line wrap: on
line source

#
# Copyright (c) 2001 Richard Jones, richard@bofh.asn.au.
# This module is free software, and you may redistribute it and/or modify
# under the same terms as Python, so long as this copyright message and
# disclaimer are retained in their original form.
#
# This module is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# 

"""This module provides the tokeniser used by roundup-admin.
"""
__docformat__ = 'restructuredtext'

def token_split(s, whitespace=' \r\n\t', quotes='\'"',
        escaped={'r':'\r', 'n':'\n', 't':'\t'}):
    '''Split the string up into tokens. An occurence of a ``'`` or ``"`` in
    the input will cause the splitter to ignore whitespace until a matching
    quote char is found. Embedded non-matching quote chars are also skipped.

    Whitespace and quoting characters may be escaped using a backslash.
    ``\r``, ``\n`` and ``\t`` are converted to carriage-return, newline and
    tab.  All other backslashed characters are left as-is.

    Valid examples::

           hello world      (2 tokens: hello, world)
           "hello world"    (1 token: hello world)
           "Roch'e" Compaan (2 tokens: Roch'e Compaan)
           Roch\'e Compaan  (2 tokens: Roch'e Compaan)
           address="1 2 3"  (1 token: address=1 2 3)
           \\               (1 token: \)
           \n               (1 token: a newline)
           \o               (1 token: \o)

    Invalid examples::

           "hello world     (no matching quote)
           Roch'e Compaan   (no matching quote)
    '''
    l = []
    pos = 0
    NEWTOKEN = 'newtoken'
    TOKEN = 'token'
    QUOTE = 'quote'
    ESCAPE = 'escape'
    quotechar = ''
    state = NEWTOKEN
    oldstate = ''    # one-level state stack ;)
    length = len(s)
    finish = 0
    token = ''
    while 1:
        # end of string, finish off the current token
        if pos == length:
            if state == QUOTE: raise ValueError, "unmatched quote"
            elif state == TOKEN: l.append(token)
            break
        c = s[pos]
        if state == NEWTOKEN:
            # looking for a new token
            if c in quotes:
                # quoted token
                state = QUOTE
                quotechar = c
                pos = pos + 1
                continue
            elif c in whitespace:
                # skip whitespace
                pos = pos + 1
                continue
            elif c == '\\':
                pos = pos + 1
                oldstate = TOKEN
                state = ESCAPE
                continue
            # otherwise we have a token
            state = TOKEN
        elif state == TOKEN:
            if c in whitespace:
                # have a token, and have just found a whitespace terminator
                l.append(token)
                pos = pos + 1
                state = NEWTOKEN
                token = ''
                continue
            elif c in quotes:
                # have a token, just found embedded quotes
                state = QUOTE
                quotechar = c
                pos = pos + 1
                continue
            elif c == '\\':
                pos = pos + 1
                oldstate = state
                state = ESCAPE
                continue
        elif state == QUOTE and c == quotechar:
            # in a quoted token and found a matching quote char
            pos = pos + 1
            # now we're looking for whitespace
            state = TOKEN
            continue
        elif state == ESCAPE:
            # escaped-char conversions (t, r, n)
            # TODO: octal, hexdigit
            state = oldstate
            if escaped.has_key(c):
                c = escaped[c]
        # just add this char to the token and move along
        token = token + c
        pos = pos + 1
    return l

# vim: set filetype=python ts=4 sw=4 et si

Roundup Issue Tracker: http://roundup-tracker.org/