# Perl Weekly Challenge 96: Edit Distance

## Challenge

You are given two strings $S1 and $S2.

Write a script to find out the minimum operations required to convert $S1 into $S2. The operations can be insert, remove or replace a character. Please check out [the] Wikipedia page for more information.

## Overview

What we are asked to do, is to calculate the Levenshtein distance between two strings. If we have two strings, $source and $dest, we can recursively define how to go from $source to $dest using inserts, deletes, and replacements. Let head ($str) be the first character of a string $str, and tail ($str) be the string, except the first character. Then: • If $source is the empty string, we add each of the characters of $dest. • If $dest is the empty string, we delete each of the characters of $source. • If head ($source) equals head ($dest), then we recurse with tail ($source) and tail ($dest). • Otherwise, we either: • add head ($dest), and recurse with $source and tail ($dest).
• delete head ($dest), and recurse with tail ($source) and $dest. • replace head ($source) with head ($tail), and recurse with tail ($source) and tail ($dest). Since we just want to number of operations, we can translate this. Let $\text{lev }(s, d)$ be the Levenshtein distance between strings $s$ and $d$. Let $|t|$ be the length of the string $t$, $\text{head }(t)$ the first character of the string $t$ and $\text{tail }(t)$ the string $t$ minus the first character. Then: $\text{lev }(s, d) = \begin{cases} |a| & \text{if }|b| = 0\\ |b| & \text{if }|a| = 0\\ \text{lev }(\text{tail }(s), \text{tail }(d)) & \text{if }\text{head }(s) = \text{head }(d) \\ 1 + \text{min } \begin{cases}\text{lev }(\text{tail }(s), d) \\ \text{lev }(s, \text{tail }(d)) \\ \text{lev }(\text{tail }(s), \text{tail }(d)) \end{cases} & \text{otherwise} \end{cases}$ We can rearrange this. Define $\text{hneq }(s, d)$ as: $\text{hneq }(s, d) = \begin{cases} 1, & \text{if }\text{head }(s) \neq \text{head }(d) \\ 0, & \text{if }\text{head }(s) = \text{head }(d) \end{cases}$ That is, $\text{hneq }(s, d)$ is 1 if the first characters differ, and 0 if they are equal. Then: $\text{lev }(s, d) = \begin{cases} |a| + |b| & \text{if }|a| = 0 \text{ or }|b| = 0 \\ \text{min }\begin{cases} \text{lev }(s, \text{tail }(d)) + 1 \\ \text{lev }(\text{tail }(s), d) + 1 \\ \text{lev }(\text{tail }(s), \text{tail }(d)) + \text{hneq }(s, d) \end{cases} & \text{otherwise} \end{cases}$ We can use this to construct a recursive algorithm. However, this would require $\mathcal{O} (|s| \cdot |d|)$ recursive calls. Instead, we will be using an iterative algorithm, the Wagner Fischer algorithm. This is a more or less straight forward implementation of the above definition — except that it works the other way around. Instead of iterating through the strings, finding the shortest number of edits from the current positions to the end of the strings, it keeps track of the shortest path from the beginning to the current position. Which boils down to the same thing, it’s just finding the shortest number of edits of the reversed strings. We will be calculating a 2-dimensional array distance. distance [i] [j] will be the minimum number of edits required to change the first i characters of the source string (s) to the first j characters of the destination string (d). We then have, in pseudo-code: for j = 1 .. |d| do distance  [j] = j for j = 1 .. |d| end for i = 1 .. |s| do distance [i]  = i for i = 1 .. |s| done for i = 1 .. |s| do for j = 1 .. |d| do cost = s [i] == d [j] ? 0 : 1 distance [i] [j] = min (distance [i - 1] [j] + 1, distance [i] [j - 1] + 1, distance [i - 1] [j - 1] + cost) done done return distance [|s|] [|d|]  Note that strings are indexed 1-based. This uses $\Theta\ (|s| \cdot |d|)$ memory. We can improve on this. Note that we are using the current and previous rows of the array, that is, in the inner loop, we only index by i and i - 1. With a little rearrangement and deletion of rows we no longer use, we can reduce the memory to $\mathcal{O}\ (|s| + |d|)$: for i = 0 .. |s| do for j = 0 .. |d| do if i == 0 || j == 0 then distance [i] [j] = i + j else cost = s [i] == d [j] ? 0 : 1 distance [i] [j] = min (distance [i - 1] [j] + 1, distance [i] [j - 1] + 1, distance [i - 1] [j - 1] + cost) end end if i > 0 then delete distance [i] end end return distance [|s|] [|d|]  By just using two one dimensional arrays, and swapping the strings if the destination string is larger than the source string, we could reduce the memory usage to $\mathcal{O}\ (\text{min }(|s|, |d|))$, but we decided not to bother. Implementation of the algorithm above in various languages is now easy. # Solutions ## Perl sub LevenshteinDistance ($first, $second) { my$distance;
for (my $i = 0;$i <= length ($first);$i ++) {
for (my $j = 0;$j <= length ($second);$j ++) {
$$distance [i] [j] = i == 0 || j == 0 ? i + j : min ($$distance [$i - 1] [$j]     + 1,
$$distance [i] [j - 1] + 1,$$distance [$i - 1] [$j - 1] +
(substr ($first,$i - 1, 1) eq
substr ($second,$j - 1, 1) ? 0 : 1))
}
undef $$distance [i - 1] if i; }$$distance [-1] [-1];
}


Find the full program on GitHub.

## AWK

{
for (i = 0; i <= length ($1); i ++) { for (j = 0; j <= length ($2); j ++) {
distance [i, j] = i == 0 || j == 0 ? i + j\
: min(distance [i - 1, j]     + 1,
distance [i,     j - 1] + 1,
distance [i - 1, j - 1] +\
(substr ($1, i, 1) == substr ($2, j, 1) ? 0 : 1))
}
}
print distance [length ($1), length ($2)]
}


Find the full program on GitHub.

## C

size_t min3 (size_t a, size_t b, size_t c) {
return a < b ? a < c ? a : c
: b < c ? b : c;
}

size_t LevenshteinDistance (char * first,  size_t n,
char * second, size_t m) {

size_t ** distance;
if ((distance = malloc ((n + 1) * sizeof (size_t *))) == NULL) {
fprintf (stderr, "Out of memory\n");
exit (1);
}
for (size_t i = 0; i <= n; i ++) {
if ((distance [i] = malloc ((m + 1) * sizeof (size_t))) == NULL) {
fprintf (stderr, "Out of memory\n");
exit (1);
}
for (size_t j = 0; j <= m; j ++) {
distance [i] [j] = i == 0 || j == 0 ? i + j
: min3 (distance [i - 1] [j]     + 1,
distance [i]     [j - 1] + 1,
distance [i - 1] [j - 1] +
(first [i - 1] == second [j - 1] ? 0 : 1));
}
if (i) {
free (distance [i - 1]);
}
}
size_t out = distance [n] [m];
free (distance [n]);
free (distance);
return (out);
}


Find the full program on GitHub.

## Lua

function LevenshteinDistance (first, second)
local n = first  : len ()
local m = second : len ()
distance = {}
for i = 0, n do
distance [i] = {}
for j = 0, m do
if i == 0 or j == 0
then
distance [i] [j] = i + j
else
local cost = 1
if string . sub (first,  i, i) ==
string . sub (second, j, j)
then
cost = 0
end
distance [i] [j] = math . min (
distance [i - 1] [j]     + 1,
distance [i]     [j - 1] + 1,
distance [i - 1] [j - 1] + cost
)
end
end

if i > 0
then
distance [i - 1] = nil
end
end
return distance [n] [m]
end


Find the full program on GitHub.

## Node.js

function LevenshteinDistance (strings) {
let first  = strings  . split ("");
let second = strings  . split ("");

let distance = [];

let N = first  . length;
let M = second . length;

for (let i = 0; i <= N; i ++) {
distance [i] = [];
for (let j = 0; j <= M; j ++) {
distance [i] [j] =
i == 0 || j == 0 ? i + j
: Math . min (distance [i - 1] [j]     + 1,
distance [i]     [j - 1] + 1,
distance [i - 1] [j - 1] +
(first  [i - 1] ==
second [j - 1] ? 0 : 1))
}
if (i) {
distance [i - 1] = 0
}
}
return distance [N] [M]
}


Find the full program on GitHub.

## Python

def LevenshteinDistance (first, second):
n = len (first)
m = len (second)
distance = []
for i in range (n + 1):
distance . append ([])
for j in range (m + 1):
distance [i] . append (i + j if i == 0 or j == 0 else
min (distance [i - 1] [j]     + 1,
distance [i]     [j - 1] + 1,
distance [i - 1] [j - 1] +
(0 if first  [i - 1] ==
second [j - 1] else 1)))
return distance [n] [m]


Find the full program on GitHub.

## Ruby

def LevenshteinDistance (first, second)
n = first  . length
m = second . length
distance = []
for i in 0 .. n do
distance [i] = []
for j in 0 .. m do
distance [i] [j] = i == 0 || j == 0 ? i + j
: [distance [i - 1] [j]     + 1,
distance [i]     [j - 1] + 1,
distance [i - 1] [j - 1] +
(first [i - 1] == second [j - 1] ? 0 : 1)] . min
end

if i > 1
distance [i - 1] = nil
end
end
return distance [n] [m]
end


Find the full program on GitHub.