annotate test/test_token.py @ 5096:e74c3611b138

- issue2550636, issue2550909: Added support for Whoosh indexer. Also adds new config.ini setting called indexer to select indexer. See ``doc/upgrading.txt`` for details. Initial patch done by David Wolever. Patch modified (see ticket or below for changes), docs updated and committed. I have an outstanding issue with test/test_indexer.py. I have to comment out all imports and tests for indexers I don't have (i.e. mysql, postgres) otherwise no tests run. With that change made, dbm, sqlite (rdbms), xapian and whoosh indexes are all passing the indexer tests. Changes summary: 1) support native back ends dbm and rdbms. (original patch only fell through to dbm) 2) Developed whoosh stopfilter to not index stopwords or words outside the the maxlength and minlength limits defined in index_common.py. Required to pass the extremewords test_indexer test. Also I removed a call to .lower on the input text as the tokenizer I chose automatically does the lowercase. 3) Added support for max/min length to find. This was needed to pass extremewords test. 4) Added back a call to save_index in add_text. This allowed all but two tests to pass. 5) Fixed a call to: results = searcher.search(query.Term("identifier", identifier)) which had an extra parameter that is an error under current whoosh. 6) Set limit=None in search call for find() otherwise it only return 10 items. This allowed it to pass manyresults test Also due to changes in the roundup code removed the call in indexer_whoosh to from roundup.anypy.sets_ import set since we use the python builtin set.
author John Rouillard <rouilj@ieee.org>
date Sat, 25 Jun 2016 20:10:03 -0400
parents 364c54991861
children 6971c9249c6d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
470
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 #
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 # Copyright (c) 2001 Richard Jones
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 # This module is free software, and you may redistribute it and/or modify
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 # under the same terms as Python, so long as this copyright message and
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 # disclaimer are retained in their original form.
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 #
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # This module is distributed in the hope that it will be useful,
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 # but WITHOUT ANY WARRANTY; without even the implied warranty of
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 import unittest, time
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 from roundup.token import token_split
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15 class TokenTestCase(unittest.TestCase):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 def testValid(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 l = token_split('hello world')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 self.assertEqual(l, ['hello', 'world'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 def testIgnoreExtraSpace(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 l = token_split('hello world ')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22 self.assertEqual(l, ['hello', 'world'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 def testQuoting(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25 l = token_split('"hello world"')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 self.assertEqual(l, ['hello world'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 l = token_split("'hello world'")
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 self.assertEqual(l, ['hello world'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 def testEmbedQuote(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 l = token_split(r'Roch\'e Compaan')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 self.assertEqual(l, ["Roch'e", "Compaan"])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 l = token_split('address="1 2 3"')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 self.assertEqual(l, ['address=1 2 3'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 def testEscaping(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 l = token_split('"Roch\'e" Compaan')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 self.assertEqual(l, ["Roch'e", "Compaan"])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 l = token_split(r'hello\ world')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 self.assertEqual(l, ['hello world'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 l = token_split(r'\\')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 self.assertEqual(l, ['\\'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 l = token_split(r'\n')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 self.assertEqual(l, ['\n'])
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46 def testBadQuote(self):
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 self.assertRaises(ValueError, token_split, '"hello world')
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48 self.assertRaises(ValueError, token_split, "Roch'e Compaan")
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49
9f7320624bc2 Added better tokenising to roundup-admin - handles spaces and stuff.
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 # vim: set filetype=python ts=4 sw=4 et si

Roundup Issue Tracker: http://roundup-tracker.org/