Re: Minimum levenshtein distance for a set of words



Keith Ramsay wrote:
> weinzier wrote:
> |What I am looking for is an algorithm which takes
> | - set of words
> | - costs for insert, delete, replace
> |and gives
> | - minimum levenshtein distance
> | - tells me how many words have this distance (one, many, infinite)
> | - set of all words which have the minimum distance if finite
>
> I'm guessing that by "minimum distance" you mean the
> minimum of the maximum distance from s to one of the
> given strings.

What I meant was the word which minimizes the sum of distances from s
to any of the words.

> |I am sure there is literature about this out there.
> |Again any hints welcome...
>
> The most famous reference, perhaps, is _Time Warps, String Edits,
> and Macromolecules: The Theory and Practice of Sequence Comparison_,
> David Sankoff and Joseph Kruskal eds.

Sounds promising, it is in my local bib, can't get hold of it before
weekend...

> The algorithm in there that perhaps comes closest to doing
> the job is one that would give you the string having the
> smallest *total* distance to the given strings.

Seems that this is *exactly* what I was looking for.

> It doesn't
> take much of a modification to get the string having the
> smallest weighted sum of distances to the given strings.
> There's a tweak for getting all the possible strings achieving
> the minimum, just as it's possible to generate all the
> sequences of insertions and deletions that achieve the minimum--
> although the number can of course increase quickly.

The weighted sum stuff I do not really need.
Does it work with arbitrary cost values for the levenshtein distance?

>
> The algorithm solves the problem for each set of strings
> that you can get by taking an initial segment of each of
> the original strings respectively. Consequently the time
> required grows quickly with the number of strings. I don't
> suppose that anybody knows of any terribly efficient
> algorithm known for the problems, although this is better
> than a more naive brute-force approach.

What do you mean by brute force?
Since the words I am looking for need not be in the inital set the only
brute force method I can think of is to check *any* word (up to a
certain length).

What would be much of interest is a way of knowing how many solutions
are there without doing the whole lot of computations.
Some rules (with proofs) like for a linear system of equations.
For example for a set A with two elements and cost values all the same
the solution is always A.
I suppose that under certain circumstances the solution is the longest
common subsequence.

.



Relevant Pages

  • Re: Algorithm/Theory help: Patterns, comparing, calculating distance between?
    ... The definition and computation distances between strings is a heavily ... To get a handle on the literature, look for Levenshtein distance (check ... Then you find the minimum cost sequence of edits that are required to ... it is common to define a positive cost for each non-trivial edit ...
    (comp.ai)
  • Re: Minimum levenshtein distance for a set of words
    ... | - tells me how many words have this distance ... The algorithm in there that perhaps comes closest to doing ... smallest *total* distance to the given strings. ... Then I suppose by adjusting the weights you could find the ...
    (sci.math)
  • Re: Distance normalized TSP algorithm
    ... Make a larger graph containing two of these ... Define "match well with distance," please. ... tend to cost more, so it costs more to fly to a distant city than to ... which lead me to the distance normalized algorithm where all ...
    (comp.lang.java.programmer)
  • Re: "Looksalike" algorithm
    ... This is the Levenshtein distance: ... problem on an algorithm that finds the "edit distance" between two strings ... Googling for "edit distance" should hopefully return pages for this. ... I'm looking around for a VB.NET algorithm that can do non-exact ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Best Job Skill --> .NET or Java
    ... strings, ... But the same basic brute-force algorithm was ... It compiled a histogram of trigrams, ... finds one random trigram that is unique, it expands that one, ...
    (comp.programming)