Home Download Buy Blog Forum Support

Fuzzy string matching

Fuzzy string matching

Postby gui.apprentice on Tue Feb 28, 2012 7:00 pm

I'm a new Sublime Text 2 user and I absolutely fell in love with the fuzzy logic string matching for opening files and projects.

I'm curious if anyone knows how this was implemented. Some quick searching on the topic reveals some dynamic programming techniques and algorithms such as the Levenshtein distance computing algorithm or other preprocessing methods using suffix trees, but these seem to be more focused on spell checking rather than mapping a few characters to a subset of strings, such as "csh" mapping to "CacheSerializer.php".

I would be more than grateful for any ideas you may have on the topic. Thanks!
gui.apprentice
 
Posts: 1
Joined: Tue Feb 28, 2012 6:51 pm

Re: Fuzzy string matching

Postby spadgos on Thu Mar 01, 2012 9:32 am

In terms of finding matching strings, I think it just checks that a particular string (eg: a file path, or command name) contains all the letters you've searched for in the correct order. a search for "foobar" could be expressed in a regex like so: .*f.*o.*o.*b.*a.*r.* so it would match "foobar", "blah/foobar", "foo/blah/bar", "football bars" and so on. The next part is ordering it to get the most relevant result to the top. I'm not sure of the algorithm used here, but I have noticed that if there is less distance between the matching letters, that pushes it up the list. eg: "blah/foobar" beats "foo/blah/bar". For the Open Anything dialog, I believe it also gives preference to recently-accessed files too.
spadgos
 
Posts: 121
Joined: Thu Oct 06, 2011 12:49 am

Re: Fuzzy string matching

Postby segfault on Tue Apr 10, 2012 5:39 pm

I've been wondering about this too. Clearly there are some heuristics at work here where previous selections influence future predictions, but discounting those for the moment I think I have an algorithm that is comparable.

Begin with a list of choices. Each entry gets an integer score. Entries with scores > 0 are displayed in Ascending order (so score 1 at the top).

1) assign each possible match a score of 1. Assign a pointer to the beginning of each string.
2) a character is entered.
3) for each string, increment the pointer until the character is found.
3a) if you reach the end of the string, remove the string from the list of candidates. Else:
3b) add the number of increments to the score for the string.
4) sort the list by score
5) goto 2.

One behavior this algorithm will have that I'm not 100% sure is a behavior of ST2 is that this algorithm greatly prefers matches at the beginning of the string to the end. Depending on the type of data you're matching, this can be good or bad.

Starting with this algorithm you can add in the features we observe in ST2, such as backspace restoring deleted items and heuristic matching tricks, and I think you get pretty close to what ST2 does. I was thinking about this today because I would *love* to implement it for powershell scripts.

p.s. somehow I ended up with two forum accounts here by accident so I'm reposting this.
segfault
 
Posts: 7
Joined: Thu Feb 02, 2012 2:37 pm

Re: Fuzzy string matching

Postby facelessuser on Tue Apr 10, 2012 11:13 pm

If you want to use similar fuzzy matching in plugins without feeding it through the quick panel or writing your own algorithm, you can try using difflib.

Code: Select all
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']


You can control the fuzziness and the number of return values.
Code: Select all
difflib.get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.


There are also other functions as well for different purposes or different applications.

http://docs.python.org/library/difflib.html
facelessuser
 
Posts: 1576
Joined: Tue Apr 05, 2011 7:38 pm

Re: Fuzzy string matching

Postby segfault on Mon Apr 16, 2012 9:23 pm

Thanks for the tip, facelessuser, that is really cool.

I wanted this for a powershell script of mine, so implemented it as a c# powershell module. It should be simple enough to adapt to another project. There is a class with the fuzzy logic I described below (with one change), a WPF user control that tries to emulate ST2 (to me at least), a WPF dialog that will host the control, and one poweshell cmdlet.

https://bitbucket.org/dwarburton/fuzzyselector

the one change i made to the algorithm from my earlier post was to increase the cost for spaces between characters after each match. So after the first character is matched then the score increases 2x for each place the pointer moves and 3x after the second match and so on. This means the algorithm will prefer strings that match characters in clusters over those where the matches are spread about, or at least that is the idea.

I haven't thought of a good way to implement choice memoization, but I would like to. Persistent storage would be ideal but that ties it to a concrete implementation a little bit.
segfault
 
Posts: 7
Joined: Thu Feb 02, 2012 2:37 pm


Return to Plugin Development

Who is online

Users browsing this forum: Yahoo [Bot] and 12 guests