Perl Weekly Challenge 96: Edit Distance

You are given two strings $S1 and $S2.

Write a script to find out the minimum operations required to convert $S1 into $S2. The operations can be insert, remove or replace a character. Please check out [the] Wikipedia page for more information.

Overview

What we are asked to do, is to calculate the Levenshtein distance between two strings. If we have two strings, $source and $dest, we can recursively define how to go from $source to $dest using inserts, deletes, and replacements. Let head ($str) be the first character of a string $str, and tail ($str) be the string, except the first character. Then:

  • If $source is the empty string, we add each of the characters of $dest.
  • If $dest is the empty string, we delete each of the characters of $source.
  • If head ($source) equals head ($dest), then we recurse with tail ($source) and tail ($dest).
  • Otherwise, we either:
    • add head ($dest), and recurse with $source and tail ($dest).
    • delete head ($dest), and recurse with tail ($source) and $dest.
    • replace head ($source) with head ($tail), and recurse with tail ($source) and tail ($dest).

Since we just want to number of operations, we can translate this. Let \text{lev }(s, d) be the Levenshtein distance between strings s and d. Let |t| be the length of the string t, \text{head }(t) the first character of the string t and \text{tail }(t) the string t minus the first character. Then:

\text{lev }(s, d) = \begin{cases} |a| & \text{if }|b| = 0\\ |b| & \text{if }|a| = 0\\ \text{lev }(\text{tail }(s), \text{tail }(d)) & \text{if }\text{head }(s) = \text{head }(d) \\ 1 + \text{min } \begin{cases}\text{lev }(\text{tail }(s), d) \\              \text{lev }(s, \text{tail }(d)) \\              \text{lev }(\text{tail }(s), \text{tail }(d)) \end{cases} & \text{otherwise} \end{cases}

We can rearrange this. Define \text{hneq }(s, d) as:

\text{hneq }(s, d) = \begin{cases} 1, & \text{if }\text{head }(s) \neq \text{head }(d) \\ 0, & \text{if }\text{head }(s) = \text{head }(d) \end{cases}

That is, \text{hneq }(s, d) is 1 if the first characters differ, and 0 if they are equal. Then:

\text{lev }(s, d) = \begin{cases} |a| + |b| & \text{if }|a| = 0 \text{ or }|b| = 0 \\ \text{min }\begin{cases} \text{lev }(s, \text{tail }(d)) + 1 \\ \text{lev }(\text{tail }(s), d) + 1 \\ \text{lev }(\text{tail }(s), \text{tail }(d)) + \text{hneq }(s, d) \end{cases} & \text{otherwise} \end{cases}

We can use this to construct a recursive algorithm. However, this would require \mathcal{O} (|s| \cdot |d|) recursive calls. Instead, we will be using an iterative algorithm, the Wagner Fischer algorithm. This is a more or less straight forward implementation of the above definition — except that it works the other way around. Instead of iterating through the strings, finding the shortest number of edits from the current positions to the end of the strings, it keeps track of the shortest path from the beginning to the current position. Which boils down to the same thing, it’s just finding the shortest number of edits of the reversed strings.

We will be calculating a 2-dimensional array distance. distance [i] [j] will be the minimum number of edits required to change the first i characters of the source string (s) to the first j characters of the destination string (d). We then have, in pseudo-code:

for j = 1 .. |d| do
    distance [0] [j] = j for j = 1 .. |d|
end
for i = 1 .. |s| do
    distance [i] [0] = i for i = 1 .. |s|
done
for i = 1 .. |s| do
    for j = 1 .. |d| do
        cost = s [i] == d [j] ? 0 : 1
        distance [i] [j] = min (distance [i - 1] [j]     + 1,
                                distance [i]     [j - 1] + 1,
                                distance [i - 1] [j - 1] + cost)
    done
done
return distance [|s|] [|d|]

Note that strings are indexed 1-based.

This uses \Theta\ (|s| \cdot |d|) memory. We can improve on this. Note that we are using the current and previous rows of the array, that is, in the inner loop, we only index by i and i - 1. With a little rearrangement and deletion of rows we no longer use, we can reduce the memory to \mathcal{O}\ (|s| + |d|):

for i = 0 .. |s| do
    for j = 0 .. |d| do
        if i == 0 || j == 0
        then
            distance [i] [j] = i + j
        else
            cost = s [i] == d [j] ? 0 : 1
            distance [i] [j] = min (distance [i - 1] [j]     + 1,
                                    distance [i]     [j - 1] + 1,
                                    distance [i - 1] [j - 1] + cost)
        end
    end
    if i > 0
    then
        delete distance [i]
    end
end
return distance [|s|] [|d|]

By just using two one dimensional arrays, and swapping the strings if the destination string is larger than the source string, we could reduce the memory usage to \mathcal{O}\ (\text{min }(|s|, |d|)), but we decided not to bother.

Implementation of the algorithm above in various languages is now easy.

Solutions

Perl

sub LevenshteinDistance ($first, $second) {
    my $distance;
    for (my $i = 0; $i <= length ($first); $i ++) {
        for (my $j = 0; $j <= length ($second); $j ++) {
            $$distance [$i] [$j] =
                $i == 0 || $j == 0 ? $i + $j
              : min ($$distance [$i - 1] [$j]     + 1,
                     $$distance [$i]     [$j - 1] + 1,
                     $$distance [$i - 1] [$j - 1] +
                        (substr ($first,  $i - 1, 1) eq
                         substr ($second, $j - 1, 1) ? 0 : 1))
        }
        undef $$distance [$i - 1] if $i;
    }
    $$distance [-1] [-1];
}

Find the full program on GitHub.

AWK

{
    for (i = 0; i <= length ($1); i ++) {
        for (j = 0; j <= length ($2); j ++) {
            distance [i, j] = i == 0 || j == 0 ? i + j\
                     : min(distance [i - 1, j]     + 1,
                           distance [i,     j - 1] + 1,
                           distance [i - 1, j - 1] +\
                           (substr ($1, i, 1) == substr ($2, j, 1) ? 0 : 1))
        }
    }
    print distance [length ($1), length ($2)]
}

Find the full program on GitHub.

C

size_t min3 (size_t a, size_t b, size_t c) {
    return a < b ? a < c ? a : c
                 : b < c ? b : c;
}

size_t LevenshteinDistance (char * first,  size_t n,
                            char * second, size_t m) {
      
    size_t ** distance;
    if ((distance = malloc ((n + 1) * sizeof (size_t *))) == NULL) {
        fprintf (stderr, "Out of memory\n");
        exit (1);
    }
    for (size_t i = 0; i <= n; i ++) {
        if ((distance [i] = malloc ((m + 1) * sizeof (size_t))) == NULL) {
            fprintf (stderr, "Out of memory\n");
            exit (1);
        }
        for (size_t j = 0; j <= m; j ++) {
            distance [i] [j] = i == 0 || j == 0 ? i + j
                   : min3 (distance [i - 1] [j]     + 1,
                           distance [i]     [j - 1] + 1,
                           distance [i - 1] [j - 1] +
                             (first [i - 1] == second [j - 1] ? 0 : 1));
        }
        if (i) {
            free (distance [i - 1]);
        }
    }
    size_t out = distance [n] [m];
    free (distance [n]);
    free (distance);
    return (out);
}

Find the full program on GitHub.

Lua

function LevenshteinDistance (first, second)
    local n = first  : len ()
    local m = second : len ()
    distance = {}
    for i = 0, n do
        distance [i] = {}
        for j = 0, m do
            if i == 0 or j == 0
            then
                distance [i] [j] = i + j
            else
                local cost = 1
                if string . sub (first,  i, i) ==
                   string . sub (second, j, j)
                then
                    cost = 0
                end
                distance [i] [j] = math . min (
                    distance [i - 1] [j]     + 1,
                    distance [i]     [j - 1] + 1,
                    distance [i - 1] [j - 1] + cost
                )
            end
        end

        if i > 0
        then
            distance [i - 1] = nil
        end
    end
    return distance [n] [m]
end

Find the full program on GitHub.

Node.js

function LevenshteinDistance (strings) {
    let first  = strings [0] . split ("");
    let second = strings [1] . split ("");

    let distance = [];

    let N = first  . length;
    let M = second . length;

    for (let i = 0; i <= N; i ++) {
        distance [i] = [];
        for (let j = 0; j <= M; j ++) {
            distance [i] [j] = 
                i == 0 || j == 0 ? i + j
              : Math . min (distance [i - 1] [j]     + 1,
                            distance [i]     [j - 1] + 1,
                            distance [i - 1] [j - 1] +
                               (first  [i - 1] ==
                                second [j - 1] ? 0 : 1))
        }
        if (i) {
            distance [i - 1] = 0
        }
    }
    return distance [N] [M]
}

Find the full program on GitHub.

Python

def LevenshteinDistance (first, second):
    n = len (first)
    m = len (second)
    distance = []
    for i in range (n + 1):
        distance . append ([])
        for j in range (m + 1):
            distance [i] . append (i + j if i == 0 or j == 0 else
                         min (distance [i - 1] [j]     + 1,
                              distance [i]     [j - 1] + 1,
                              distance [i - 1] [j - 1] +
                                 (0 if first  [i - 1] ==
                                       second [j - 1] else 1)))
    return distance [n] [m]

Find the full program on GitHub.

Ruby

def LevenshteinDistance (first, second)
    n = first  . length
    m = second . length
    distance = []
    for i in 0 .. n do
        distance [i] = []
        for j in 0 .. m do
            distance [i] [j] = i == 0 || j == 0 ? i + j
                    : [distance [i - 1] [j]     + 1,
                       distance [i]     [j - 1] + 1,
                       distance [i - 1] [j - 1] +
                          (first [i - 1] == second [j - 1] ? 0 : 1)] . min
        end

        if i > 1
            distance [i - 1] = nil
        end
    end
    return distance [n] [m]
end

Find the full program on GitHub.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s