Sublime Forum

Fuzzy string matching

#1

I’m a new Sublime Text 2 user and I absolutely fell in love with the fuzzy logic string matching for opening files and projects.

I’m curious if anyone knows how this was implemented. Some quick searching on the topic reveals some dynamic programming techniques and algorithms such as the Levenshtein distance computing algorithm or other preprocessing methods using suffix trees, but these seem to be more focused on spell checking rather than mapping a few characters to a subset of strings, such as “csh” mapping to “CacheSerializer.php”.

I would be more than grateful for any ideas you may have on the topic. Thanks!

0 Likes

Sublimetext2 fuzzy search algorithim
#2

In terms of finding matching strings, I think it just checks that a particular string (eg: a file path, or command name) contains all the letters you’ve searched for in the correct order. a search for “foobar” could be expressed in a regex like so: .*f.*o.*o.*b.*a.r. so it would match “foobar”, “blah/foobar”, “foo/blah/bar”, “football bars” and so on. The next part is ordering it to get the most relevant result to the top. I’m not sure of the algorithm used here, but I have noticed that if there is less distance between the matching letters, that pushes it up the list. eg: “blah/foobar” beats “foo/blah/bar”. For the Open Anything dialog, I believe it also gives preference to recently-accessed files too.

0 Likes

#3

I’ve been wondering about this too. Clearly there are some heuristics at work here where previous selections influence future predictions, but discounting those for the moment I think I have an algorithm that is comparable.

Begin with a list of choices. Each entry gets an integer score. Entries with scores > 0 are displayed in Ascending order (so score 1 at the top).

  1. assign each possible match a score of 1. Assign a pointer to the beginning of each string.
  2. a character is entered.
  3. for each string, increment the pointer until the character is found.
    3a) if you reach the end of the string, remove the string from the list of candidates. Else:
    3b) add the number of increments to the score for the string.
  4. sort the list by score
  5. goto 2.

One behavior this algorithm will have that I’m not 100% sure is a behavior of ST2 is that this algorithm greatly prefers matches at the beginning of the string to the end. Depending on the type of data you’re matching, this can be good or bad.

Starting with this algorithm you can add in the features we observe in ST2, such as backspace restoring deleted items and heuristic matching tricks, and I think you get pretty close to what ST2 does. I was thinking about this today because I would love to implement it for powershell scripts.

p.s. somehow I ended up with two forum accounts here by accident so I’m reposting this.

0 Likes

#4

If you want to use similar fuzzy matching in plugins without feeding it through the quick panel or writing your own algorithm, you can try using difflib.

[code]>>> get_close_matches(‘appel’, ‘ape’, ‘apple’, ‘peach’, ‘puppy’])
‘apple’, ‘ape’]

import keyword
get_close_matches(‘wheel’, keyword.kwlist)
‘while’]

get_close_matches(‘apple’, keyword.kwlist)
]

get_close_matches(‘accept’, keyword.kwlist)
‘except’]
[/code]

You can control the fuzziness and the number of return values.

[code]difflib.get_close_matches(word, possibilities, n], cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.[/code]

There are also other functions as well for different purposes or different applications.

docs.python.org/library/difflib.html

0 Likes

#5

Thanks for the tip, facelessuser, that is really cool.

I wanted this for a powershell script of mine, so implemented it as a c# powershell module. It should be simple enough to adapt to another project. There is a class with the fuzzy logic I described below (with one change), a WPF user control that tries to emulate ST2 (to me at least), a WPF dialog that will host the control, and one poweshell cmdlet.

https://bitbucket.org/dwarburton/fuzzyselector

the one change i made to the algorithm from my earlier post was to increase the cost for spaces between characters after each match. So after the first character is matched then the score increases 2x for each place the pointer moves and 3x after the second match and so on. This means the algorithm will prefer strings that match characters in clusters over those where the matches are spread about, or at least that is the idea.

I haven’t thought of a good way to implement choice memoization, but I would like to. Persistent storage would be ideal but that ties it to a concrete implementation a little bit.

0 Likes