Skip to content

Trey314159/TokenAwareEditDistance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Token-Aware Edit Distance

March 2020

Algorithm by Trey Jones.
Core code improvements by Zbyszko Papierski.
Post-fork documentation and code updates and general code diminishments also by Trey Jones.

Forked from the Wikimedia Search Platform team's Glent project

Post-Fork Changes

  • Modified this ReadMe to be less Glent-oriented
  • Added command-line driver and documentation
  • TODO?: add optimization code that can run using the driver
  • TODO?: add full edit distance table output

Background & Rationale

Edit distance is a commonly used metric for string similarity. Levenshtein distance measures the number of insertions, deletions, and substitutions required to transform one string into another. Damerau–Levenshtein distance also allows for transpositions (swaps).

Glent is a project/repository intended to improve search suggestions made on-wiki (commonly called "Did you mean" or "DYM" suggestions). After reviewing the suggestions made by Glent's Method 1/"similar queries", a number of common problems among poor-quality suggestions became apparent. Token-Aware Edit Distance is intended to address those concerns related to the edit distance calculation, and to generate a more fine-grained score.

The algorithm may also be useful outside the context of Glent.

Issues Addressed

Token-Aware Edit Limits
Using edit distance on short strings of a few words can lead to some undesirable behavior.

A default edit limit for a word of, say, 2 edits is reasonable, with a lower limit for very short words. This makes perfect sense when applied to individual words—allowing two edits to IQ would allow it to change into any other two-letter word! With much longer texts, like a paragraph or more, changing two letters—even if it completely changes a single word—doesn't really impact the overall similarity of the two paragraphs.

Strings made up of a handful of words, though, are a different matter. Allowing two edits to bar band can result in a band, which has a fairly different meaning. Normally we wouldn't allow a three-letter word like bar to have two edits, so we should block it from happening to the word bar as part of a longer string, too. Hence, the "token-aware" part of Token-Aware Edit Distance, which allows us to apply edit limits at the token level rather than the string level.

Use setPerTokenLimit(boolean perTokenLimit) to configure whether edit limits are applied per-token. By default, perTokenLimit is true.

Normalized Edit Limits (Percentages)
To simplify the configuration of variable edit limits, rather than specify specific cut-offs (e.g., zero edits for one or two letters, one edit for three or four letters, two edits for five or more letters), it makes more sense to specify the proportional limit as a fraction of the total length. This also allows for very small edits, like duplicate letters, even in very short strings. So changing od to odd is not one edit, but perhaps only 0.1 edits, which is only 5% of the length of the original two-letter word. (Note that this kind of math may also affect the effective length of a word; if inserting or deleting a duplicate letter only costs 0.1 edits, then odd is arguably only 2.1 letters long.)

Normalized edit limits work in conjunction with the hard limit on edits, and whichever is lower is used. For example, if you have the normalized edit limit set to 0.4 (i.e., 40%) and the plain edit limit set to 2, then a string or token of length 8 would only be allowed 2 edits (i.e., min(2, 8*0.4)).

String pairs that are over either limit are assigned an edit distance of Float.POSITIVE_INFINITY.

Use setDefaultLimit(float defaultLimit) to configure the maximum number of edits for the whole string. By default, defaultLimit is 2.0f. Set defaultLimit to 0f for no hard limit on edits per string.

defaultLimit is also used for early termination of the edit distance computation. For example, if the limit is 2 and we can compute that a minimum of 3 edits are necessary, we can terminate early (returning a distance of Float.POSITIVE_INFINITY).

Use setDefaultNormLimit(float defaultNormLimit) to configure the maximum number of edits for the whole string. By default, defaultNormLimit is 0f. This value is interpreted as a percentage, so 0.4 means that a string or token (see perTokenLimit above) can be changed by up to 40%. Set defaultNormLimit to 0f for no proportional per-string or per-token limit.

If both defaultLimit and defaultNormLimit are set to 0f, then the full edit distance will be computed and returned.

Note that in some cases setting defaultNormLimit can result in a larger total edit distance, because some edit paths are blocked. This is usually a good thing.

Length Normalization Type
When we talk about lower edit distance limits for shorter strings, words, or tokens, sometimes our edits can cross the boundary between what is or is not "shorter". If, say, we allow two edits for five-letter words, but only one edit for three-letter words, how do we handle brand being edited (with two deletions) to ban, and is that different from inserting letters in the opposite direction?

To address this, we can specify that edits are counted against the longer string or token, against the shorter string or token, or against the first string or token. Specifying the first string or token as the one to measure against allows us to skew the limit based on the "original" string (e.g., user input). Thus, brand could be shortened to ban (two edits to a five-letter word) but ban could be blocked from lengthening to brand (two edits to a three-letter word). Which choice is the best is an empirical or philosophical question. In the case of Glent "similar queries", it seems that being more aggressive and normalizing against the shorter string more often gives better results. Your mileage may vary.

Use setNormType(NormType normType) to specify the length normalization type. By default, normType is NormType.MAX; other possible values are NormType.MIN and NormType.FIRST.

Initial Letter Change Penalty
While any typo that can happen probably will happen at some point, changing the first letter of a word seems like a bigger change than changing any other letter, so it should carry a greater weight.

Use setTokenInitialPenalty(float tokenInitialPenalty) to configure the additional penalty for changing the first letter of a token. By default, tokenInitialPenalty is 0.25f.

Penalty for Changing the Number of Tokens
Splitting up words—especially by deleting letters—can really change a query. An egregious example is abbys, which, after deleting the first b and the y, gives a b s, which is very different! So, we penalize both changing the number of tokens and having to change a letter to a space (or vice versa).

Use setTokenDeltaPenalty(float tokenDeltaPenalty) to configure the additional penalty for increasing or decreasing the number of tokens; the penalty is per net token added or lost. By default, tokenDeltaPenalty is 0.25f.

Use setTokenSepSubstPenalty(float tokenSepSubstPenalty) to configure the additional penalty (in addition to the usual substitution cost) for converting a space (or other token-separator character) to any other character, or vice versa. By default, tokenSepSubstPenalty is 0.5f.

Spaceless Equality Exception
One exception to the penalty for changing the number of tokens is that strings that are identical except for spaces or punctuation should have a lower edit distance. So, abbysplace and abby's place are obviously very similar.

Use setSpaceOnlyCost(float spaceOnlyCost) to configure the cost of inserting or deleting spaces in a strings that differ only by spaces. (Note: spaces, or the token-separator character you specify.) By default, spaceOnlyCost is 0.1f.

If spaceOnlyCost is equal to the usual insertion/deletion cost, then there will be no discount.

Duplicated Letters Discount
Repeated letters are a common form of typo—appolonius vs apollonius, for example—and should not count as heavily as inserting or deleting a random letter.

Use setDuplicateCost(float duplicateCost) to configure the cost of inserting or deleting duplicated letters (instead of the usual insertion/deletion cost). By default, duplicateCost is 0.05f.

Number Change Penalty
Changes in numbers are more distinct than changes in letters—e.g., 1990s championship vs 1920s championship—so such changes should have a higher cost.

Use setDigitChangePenalty(float digitChangePenalty) to configure the additional cost of changing one digit to another (on top of the usual substitution cost). By default, digitChangePenalty is 0.33f. (Note that if the digitChangePenalty plus the usual substitution penalty is more than double the insertion/deletion cost, it will have no effect, because it will be cheaper to delete one digit and insert the other as two separate operations.)

Transpositions/Swaps Discount
And of course we should be able to discount transpositions—swapped letters, like queit vs quiet—as with Damerau–Levenshtein distance.

The swap cost, as well as the default insert/deletion cost and substitution cost, are all configurable.

Use setInsDelCost(float insDelCost) to configure the cost of inserting or deleting a letter. By default, insDelCost is 1.0f.

Use setSubstCost(float substCost) to configure the cost of changing one letter to another. By default, substCost is 1.0f.

Use setSwapCost(float swapCost) to configure the cost of transposing/swapping two adjacent letters. By default, swapCost is 1.25f.

Tokenization (& Locale)

Obviously if the edit distance function is going to take tokens into account, it needs to determine what those tokens are. Also, in the typical case, we probably want to ignore most punctuation—e.g., dog and dog! are the same in most use-cases.

You can specify your own tokenizer that takes a string and returns an array of strings, each of which will be treated as a single token, and joined with the token-separator character, which is space by default. (You can specify a different token separator if spaces are meaningful in your text.) Your custom tokenizer should also do any lowercasing or other necessary text normalization.

For the languages that don't use spaces, like Chinese, a good tokenizer that breaks a string into words can still be used. Internally, the tokens will be joined by spaces (or a different token separator if you specify one). If you don't want that to happen, you can always return a single token from your tokenizer, which might still strip punctuation and normalize whitespace, for example. Bigram tokenizers or other n-gram or shingle tokenizers—which are sometimes used for CJK languages—should probably not be used for edit distance calculations.

If you don't or can't specify a tokenizer, a simple one will be built using the token separator, token-splitting regex, and locale that you specify. Input strings will have token separators (spaces by default) trimmed from either end of the string, the result is lowercased based on the locale (English by default), and that string is split based on the token-splitting regex ([\\p{Z}\\p{P}\\p{S}]+ by default, which matches sequences of Unicode spaces, punctuation, or symbols).

If you specify a null locale, no lowercasing will be done.

Another potentially easier but possibly less efficient tokenization option is to tokenize your string externally, then join the tokens with spaces (or your specified token separator) and let the built-in tokenizer re-tokenize the string on spaces. This would prevent you from having to make your tokenizer externally available if that's difficult or undesirable.

Use setTokenizer(Function<String, String[]> tokenizer) to specify a tokenization function. By default, tokenizer is null and an internal tokenizer will be built using the other values below.

Use setTokenSep(char tokenSep) to configure the token-separator character. By default, tokenSep is a space character.

Use setLocale(Locale locale) to configure the locale used for lowercasing. By default, locale is Locale.ENGLISH. Set locale to null for no lowercasing. locale is only used if no tokenizer is specified with setTokenizer().

Use setTokenSplit(String tokenSplit) to configure the regular expression used for splitting strings into tokens. By default, tokenSplit is [\\p{Z}\\p{P}\\p{S}]+. This may be too aggressive for some use cases. If you remove \\p{S} (Unicode symbols) from the regex, it may be advisable to add \\+ in its place when working with strings that may have moved across the internet, since + gets used instead of space with some regularity. tokenSplit is only used if no tokenizer is specified with setTokenizer().

More Info

See the API docs in target/apidocs/ for more information on class structure and functions.

See TokenAwareEditDistanceTest.java in src/test/... for simple examples of building and configuring a token-aware edit distance object.

On Wikimedia Commons: Video presentation and slides providing an overview of a new token-aware edit distance algorithm, with a review of basic edit distance and some of its variations, and a look at some of the data from the Glent search suggestions project that lead to the development of the algorithm.

About

Unicode compatible token-aware edit distance variant, with lots of configurable bells and whistles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages