Class StringDistance

java.lang.Object
com.github.yellowstonegames.text.StringDistance

public class StringDistance extends Object
Estimates how different two String inputs are. This can be configured with different costs for inserting, deleting, replacing, or swapping characters.
Uses the Damerau-Levenshtein Algorithm, an extension to the Levenshtein Algorithm which solves the edit distance problem between a source string and a target string with the following operations:
  • Character Insertion
  • Character Deletion
  • Character Replacement
  • Adjacent Character Swap
Note that the adjacent character swap operation is an edit that may be applied when two adjacent characters in the source string match two adjacent characters in the target string, but in reverse order, rather than a general allowance for adjacent character swaps.
This implementation allows the client to specify the costs of the various edit operations with the restriction that the cost of two swap operations must not be less than the cost of a delete operation followed by an insert operation. This restriction is required to preclude two swaps involving the same character being required for optimality which, in turn, enables a fast dynamic programming solution.
The running time of the Damerau-Levenshtein algorithm is O(n*m) where n is the length of the source string and m is the length of the target string. This implementation consumes O(n*m) space on the first call to distance(CharSequence, CharSequence), but won't always consume more space than that - the total this uses is O(n*m) for the largest value of n*m requested on any call to distance() on the same instance.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    final int
    The cost to delete one char from anywhere in source on the way to target.
    final int
    The cost to insert one char into anywhere in source on the way to target.
    final int
    The cost to change one char in source to any other at the same position, on the way to target.
    final int
    The cost to switch the positions of two adjacent chars in source on the way to target.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Default constructor; sets all costs to 1.
    StringDistance(int deleteCost, int insertCost, int replaceCost, int swapCost)
    Used to customize the costs of different modifications.
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    Compute the Damerau-Levenshtein distance between the specified source string and the specified target string.

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • deleteCost

      public final int deleteCost
      The cost to delete one char from anywhere in source on the way to target. The following must be true: swapCost * 2 >= insertCost + deleteCost
    • insertCost

      public final int insertCost
      The cost to insert one char into anywhere in source on the way to target. The following must be true: swapCost * 2 >= insertCost + deleteCost
    • replaceCost

      public final int replaceCost
      The cost to change one char in source to any other at the same position, on the way to target.
    • swapCost

      public final int swapCost
      The cost to switch the positions of two adjacent chars in source on the way to target. The following must be true: swapCost * 2 >= insertCost + deleteCost
  • Constructor Details

    • StringDistance

      public StringDistance()
      Default constructor; sets all costs to 1.
    • StringDistance

      public StringDistance(int deleteCost, int insertCost, int replaceCost, int swapCost)
      Used to customize the costs of different modifications.
      Parameters:
      deleteCost - the cost of deleting a character.
      insertCost - the cost of inserting a character.
      replaceCost - the cost of replacing a character.
      swapCost - the cost of swapping two adjacent characters. The following must be true: swapCost * 2 >= insertCost + deleteCost
  • Method Details

    • distance

      public int distance(CharSequence source, CharSequence target)
      Compute the Damerau-Levenshtein distance between the specified source string and the specified target string.
      Parameters:
      source - the starting String; insertion, deletion, etc. are considered as if applied to this
      target - the goal String; we are measuring the edit distance to get from source to target
      Returns:
      the edit distance from source to target with the configured costs, as an int