Class StringDistance
java.lang.Object
com.github.yellowstonegames.text.StringDistance
Estimates how different two String inputs are. This can be configured with
different costs for inserting, deleting, replacing, or swapping characters.
Uses the Damerau-Levenshtein Algorithm, an extension to the Levenshtein Algorithm which solves the edit distance problem between a source string and a target string with the following operations:
This implementation allows the client to specify the costs of the various edit operations with the restriction that the cost of two swap operations must not be less than the cost of a delete operation followed by an insert operation. This restriction is required to preclude two swaps involving the same character being required for optimality which, in turn, enables a fast dynamic programming solution.
The running time of the Damerau-Levenshtein algorithm is O(n*m) where n is the length of the source string and m is the length of the target string. This implementation consumes O(n*m) space on the first call to
Uses the Damerau-Levenshtein Algorithm, an extension to the Levenshtein Algorithm which solves the edit distance problem between a source string and a target string with the following operations:
- Character Insertion
- Character Deletion
- Character Replacement
- Adjacent Character Swap
This implementation allows the client to specify the costs of the various edit operations with the restriction that the cost of two swap operations must not be less than the cost of a delete operation followed by an insert operation. This restriction is required to preclude two swaps involving the same character being required for optimality which, in turn, enables a fast dynamic programming solution.
The running time of the Damerau-Levenshtein algorithm is O(n*m) where n is the length of the source string and m is the length of the target string. This implementation consumes O(n*m) space on the first call to
distance(CharSequence, CharSequence), but won't always consume more
space than that - the total this uses is O(n*m) for the largest value of n*m
requested on any call to distance() on the same instance.-
Field Summary
FieldsModifier and TypeFieldDescriptionfinal intThe cost to delete one char from anywhere in source on the way to target.final intThe cost to insert one char into anywhere in source on the way to target.final intThe cost to change one char in source to any other at the same position, on the way to target.final intThe cost to switch the positions of two adjacent chars in source on the way to target. -
Constructor Summary
ConstructorsConstructorDescriptionDefault constructor; sets all costs to 1.StringDistance(int deleteCost, int insertCost, int replaceCost, int swapCost) Used to customize the costs of different modifications. -
Method Summary
Modifier and TypeMethodDescriptionintdistance(CharSequence source, CharSequence target) Compute the Damerau-Levenshtein distance between the specified source string and the specified target string.
-
Field Details
-
deleteCost
public final int deleteCostThe cost to delete one char from anywhere in source on the way to target. The following must be true:swapCost * 2 >= insertCost + deleteCost -
insertCost
public final int insertCostThe cost to insert one char into anywhere in source on the way to target. The following must be true:swapCost * 2 >= insertCost + deleteCost -
replaceCost
public final int replaceCostThe cost to change one char in source to any other at the same position, on the way to target. -
swapCost
public final int swapCostThe cost to switch the positions of two adjacent chars in source on the way to target. The following must be true:swapCost * 2 >= insertCost + deleteCost
-
-
Constructor Details
-
StringDistance
public StringDistance()Default constructor; sets all costs to 1. -
StringDistance
public StringDistance(int deleteCost, int insertCost, int replaceCost, int swapCost) Used to customize the costs of different modifications.- Parameters:
deleteCost- the cost of deleting a character.insertCost- the cost of inserting a character.replaceCost- the cost of replacing a character.swapCost- the cost of swapping two adjacent characters. The following must be true:swapCost * 2 >= insertCost + deleteCost
-
-
Method Details
-
distance
Compute the Damerau-Levenshtein distance between the specified source string and the specified target string.- Parameters:
source- the starting String; insertion, deletion, etc. are considered as if applied to thistarget- the goal String; we are measuring the edit distance to get from source to target- Returns:
- the edit distance from source to target with the configured costs, as an int
-