检测两个字符串之间的差异

问题描述 投票:31回答:5

我有2个字符串

string a = "foo bar";
string b = "bar foo";

我想检测从ab的变化。从ab我需要改变什么角色?

我认为必须对每个字符进行迭代,并检测它是否被添加,删除或保持相等。所以这是我的表达结果

'f' Remove
'o' Remove
'o' Remove
' ' Remove
'b' Equal
'a' Equal
'r' Equal
' ' Add
'f' Add
'o' Add
'o' Add

结果的类和枚举:

public enum Operation { Add,Equal,Remove };
public class Difference
{
    public Operation op { get; set; }
    public char c { get; set; }
}

这是我的解决方案,但“删除”案例对我来说并不清楚代码的外观

public static List<Difference> CalculateDifferences(string left, string right)
{
    int count = 0;
    List<Difference> result = new List<Difference>();
    foreach (char ch in left)
    {
        int index = right.IndexOf(ch, count);
        if (index == count)
        {
            count++;
            result.Add(new Difference() { c = ch, op = Operation.Equal });
        }
        else if (index > count)
        {
            string add = right.Substring(count, index - count);
            result.AddRange(add.Select(x => new Difference() { c = x, op = Operation.Add }));
            count += add.Length;
        }
        else
        {
            //Remove?
        }
    }
    return result;
}

对于删除的字符,代码必须如何?


更新 - 添加了一些示例

例1:

string a = "foobar";
string b = "fooar";

预期结果:

'f' Equal
'o' Equal
'o' Equal
'b' Remove
'a' Equal
'r' Equal

例2:

string a = "asdfghjk";
string b = "wsedrftr";

预期结果:

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add

更新:

这是Dmitry和ingen的答案之间的比较:https://dotnetfiddle.net/MJQDAO

c# string algorithm
5个回答
19
投票

您正在寻找(最小)编辑距离/(最小)编辑序列。你可以在这里找到这个过程的理论:

https://web.stanford.edu/class/cs124/lec/med.pdf

让我们实现(最简单的)Levenshtein距离/序列算法(详见qazxsw poi)。让我们从辅助类开始(我已经改变了一些你的实现):

https://en.wikipedia.org/wiki/Levenshtein_distance

据我所知,我们没有任何编辑操作,但添加+删除;这就是为什么我把 public enum EditOperationKind : byte { None, // Nothing to do Add, // Add new character Edit, // Edit character into character (including char into itself) Remove, // Delete existing character }; public struct EditOperation { public EditOperation(char valueFrom, char valueTo, EditOperationKind operation) { ValueFrom = valueFrom; ValueTo = valueTo; Operation = valueFrom == valueTo ? EditOperationKind.None : operation; } public char ValueFrom { get; } public char ValueTo {get ;} public EditOperationKind Operation { get; } public override string ToString() { switch (Operation) { case EditOperationKind.None: return $"'{ValueTo}' Equal"; case EditOperationKind.Add: return $"'{ValueTo}' Add"; case EditOperationKind.Remove: return $"'{ValueFrom}' Remove"; case EditOperationKind.Edit: return $"'{ValueFrom}' to '{ValueTo}' Edit"; default: return "???"; } } } 放在editCost = 2insertCost = 1(如果是领带:int removeCost = 1insert + remove我们放edit)。现在我们准备实施Levenstein算法:

insert + remove

演示:

public static EditOperation[] EditSequence(
  string source, string target, 
  int insertCost = 1, int removeCost = 1, int editCost = 2) {

  if (null == source)
    throw new ArgumentNullException("source");
  else if (null == target)
    throw new ArgumentNullException("target");

  // Forward: building score matrix

  // Best operation (among insert, update, delete) to perform 
  EditOperationKind[][] M = Enumerable
    .Range(0, source.Length + 1)
    .Select(line => new EditOperationKind[target.Length + 1])
    .ToArray();

  // Minimum cost so far
  int[][] D = Enumerable
    .Range(0, source.Length + 1)
    .Select(line => new int[target.Length + 1])
    .ToArray();

  // Edge: all removes
  for (int i = 1; i <= source.Length; ++i) {
    M[i][0] = EditOperationKind.Remove;
    D[i][0] = removeCost * i;
  }

  // Edge: all inserts 
  for (int i = 1; i <= target.Length; ++i) {
    M[0][i] = EditOperationKind.Add;
    D[0][i] = insertCost * i;
  }

  // Having fit N - 1, K - 1 characters let's fit N, K
  for (int i = 1; i <= source.Length; ++i)
    for (int j = 1; j <= target.Length; ++j) {
      // here we choose the operation with the least cost
      int insert = D[i][j - 1] + insertCost;
      int delete = D[i - 1][j] + removeCost;
      int edit = D[i - 1][j - 1] + (source[i - 1] == target[j - 1] ? 0 : editCost);

      int min = Math.Min(Math.Min(insert, delete), edit);

      if (min == insert) 
        M[i][j] = EditOperationKind.Add;
      else if (min == delete)
        M[i][j] = EditOperationKind.Remove;
      else if (min == edit)
        M[i][j] = EditOperationKind.Edit;

      D[i][j] = min;
    }

  // Backward: knowing scores (D) and actions (M) let's building edit sequence
  List<EditOperation> result = 
    new List<EditOperation>(source.Length + target.Length);

  for (int x = target.Length, y = source.Length; (x > 0) || (y > 0);) {
    EditOperationKind op = M[y][x];

    if (op == EditOperationKind.Add) {
      x -= 1;
      result.Add(new EditOperation('\0', target[x], op));
    }
    else if (op == EditOperationKind.Remove) {
      y -= 1;
      result.Add(new EditOperation(source[y], '\0', op));
    }
    else if (op == EditOperationKind.Edit) {
      x -= 1;
      y -= 1;
      result.Add(new EditOperation(source[y], target[x], op));
    }
    else // Start of the matching (EditOperationKind.None)
      break;
  }

  result.Reverse();

  return result.ToArray();
}

结果:

var sequence = EditSequence("asdfghjk", "wsedrftr"); 

Console.Write(string.Join(Environment.NewLine, sequence));

8
投票

我会在这里提出一个并不是最有效的算法,但很容易推理。

我们首先介绍一些基础:

1)订单很重要

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add

即使“bar”和“foo”出现在两个字符串中,“bar”也需要删除并稍后再添加。这也告诉我们string before = "bar foo" string after = "foo bar" 字符串为我们提供了我们感兴趣的字符顺序,我们首先想要“foo”。

2)订单超过计数

另一种看待它的方式是,一些角色可能永远不会轮到他们。

after

只有“酒吧”的大胆的角色,在“abracadabra”中得到他们的发言权。即使我们在两个字符串中都有两个b,但只有第一个才算数。当我们到达“bar bar”中的第二个b时,“abracadabra”中的第二个b已经通过,当我们正在寻找第一次出现'r'时。

3)障碍

障碍是两个字符串中存在的字符,需要考虑顺序和计数。这已经暗示一个集合可能不是最合适的数据结构,因为我们会失去数量。

输入

string before = "abracadabra"
string after = "bar bar"

我们得到(伪代码)

string before = "pinata"
string after = "accidental"

“皮纳塔”

“偶然”

让我们按照执行流程:

  • 'a'是第一道屏障,它也是var barriers = { 'a', 't', 'a' } 的第一个字符,因此可以删除after中第一个'a'前面的所有内容。 “pinata” - >“ata”
  • 第二个障碍是't',它不在我们的before字符串的下一个位置,所以我们可以在它们之间插入所有内容。 “ata” - >“accidenta”
  • 第三个障碍'a'已经处于下一个位置,因此我们可以在不做任何实际工作的情况下进入下一个障碍。
  • 没有更多障碍,但我们的字符串长度仍然低于after,所以会有一些后期处理。 “accidenta” - >“偶然”

注意'我'和'n'不再玩,再次,订单超过计数。


履行

我们已经确定了这个订单和计数问题,我想到了after

Queue

3
投票

我测试了上面的3个例子,它正确而完美地返回了预期的结果。

static public List<Difference> CalculateDifferences(string before, string after)
{
    List<Difference> result = new List<Difference>();
    Queue<char> barriers = new Queue<char>();

    #region Preprocessing
    int index = 0;
    for (int i = 0; i < after.Length; i++)
    {
        // Look for the first match starting at index
        int match = before.IndexOf(after[i], index);
        if (match != -1)
        {
            barriers.Enqueue(after[i]);
            index = match + 1;
        }
    }
    #endregion

    #region Queue Processing
    index = 0;
    while (barriers.Any())
    {
        char barrier = barriers.Dequeue();
        // Get the offset to the barrier in both strings, 
        // ignoring the part that's already been handled
        int offsetBefore = before.IndexOf(barrier, index) - index;
        int offsetAfter = after.IndexOf(barrier, index) - index;
        // Remove prefix from 'before' string
        if (offsetBefore > 0)
        {
            RemoveChars(before.Substring(index, offsetBefore), result);
            before = before.Substring(offsetBefore);
        }
        // Insert prefix from 'after' string
        if (offsetAfter > 0)
        {
            string substring = after.Substring(index, offsetAfter);
            AddChars(substring, result);
            before = before.Insert(index, substring);
            index += substring.Length;
        }
        // Jump over the barrier
        KeepChar(barrier, result);
        index++;
    }
    #endregion

    #region Post Queue processing
    if (index < before.Length)
    {
        RemoveChars(before.Substring(index), result);
    }
    if (index < after.Length)
    {
        AddChars(after.Substring(index), result);
    }
    #endregion

    return result;
}

static private void KeepChar(char barrier, List<Difference> result)
{
    result.Add(new Difference()
    {
        c = barrier,
        op = Operation.Equal
    });
}

static private void AddChars(string substring, List<Difference> result)
{
    result.AddRange(substring.Select(x => new Difference()
    {
        c = x,
        op = Operation.Add
    }));
}

static private void RemoveChars(string substring, List<Difference> result)
{
    result.AddRange(substring.Select(x => new Difference()
    {
        c = x,
        op = Operation.Remove
    }));
}

3
投票

这可能是另一种解决方案,完整代码和评论。但是,您的第一个原始示例的结果是反转的:

        int flag = 0;
        int flag_2 = 0;

        string a = "asdfghjk";
        string b = "wsedrftr";

        char[] array_a = a.ToCharArray();
        char[] array_b = b.ToCharArray();

        for (int i = 0,j = 0, n= 0; i < array_b.Count(); i++)
        {   
            //Execute 1 time until reach first equal character   
            if(i == 0 && a.Contains(array_b[0]))
            {
                while (array_a[n] != array_b[0])
                {
                    Console.WriteLine(String.Concat(array_a[n], " : Remove"));
                    n++;
                }
                Console.WriteLine(String.Concat(array_a[n], " : Equal"));
                n++;
            }
            else if(i == 0 && !a.Contains(array_b[0]))
            {
                Console.WriteLine(String.Concat(array_a[n], " : Remove"));
                n++;
                Console.WriteLine(String.Concat(array_b[0], " : Add"));
            }


            else
            {
                if(n < array_a.Count())
                {
                    if (array_a[n] == array_b[i])
                    {
                        Console.WriteLine(String.Concat(array_a[n], " : Equal"));
                        n++;
                    }
                    else
                    {
                        flag = 0;
                        for (int z = n; z < array_a.Count(); z++)
                        {                              
                            if (array_a[z] == array_b[i])
                            {
                                flag = 1;
                                break;
                            }                                                              
                        }

                        if (flag == 0)
                        {
                            flag_2 = 0;
                            for (int aa = i; aa < array_b.Count(); aa++)
                            {
                                for(int bb = n; bb < array_a.Count(); bb++)
                                {
                                    if (array_b[aa] == array_a[bb])
                                    {
                                        flag_2 = 1;
                                        break;
                                    }
                                }
                            }

                            if(flag_2 == 1)
                            {
                                Console.WriteLine(String.Concat(array_b[i], " : Add"));
                            }
                            else
                            {
                                for (int z = n; z < array_a.Count(); z++)
                                {
                                    Console.WriteLine(String.Concat(array_a[z], " : Remove"));
                                    n++;
                                }
                                 Console.WriteLine(String.Concat(array_b[i], " : Add"));
                            }

                        }
                        else
                        {
                            Console.WriteLine(String.Concat(array_a[n], " : Remove"));
                            i--;
                            n++;
                        }

                    }
                }
                else
                {
                    Console.WriteLine(String.Concat(array_b[i], " : Add"));
                }

            }

        }//end for


        MessageBox.Show("Done");


    //OUTPUT CONSOLE:
    /*
    a : Remove
    w : Add
    s : Equal
    e : Add
    d : Equal
    r : Add
    f : Equal
    g : Remove
    h : Remove
    j : Remove
    k : Remove
    t : Add
    r : Add
    */  

1
投票

如果您想要在两个字符串之间进行精确比较,则必须阅读并理解class Program { enum CharState { Add, Equal, Remove } struct CharResult { public char c; public CharState state; } static void Main(string[] args) { string a = "asdfghjk"; string b = "wsedrftr"; while (true) { Console.WriteLine("Enter string a (enter to quit) :"); a = Console.ReadLine(); if (a == string.Empty) break; Console.WriteLine("Enter string b :"); b = Console.ReadLine(); List<CharResult> result = calculate(a, b); DisplayResults(result); } Console.WriteLine("Press a key to exit"); Console.ReadLine(); } static List<CharResult> calculate(string a, string b) { List<CharResult> res = new List<CharResult>(); int i = 0, j = 0; char[] array_a = a.ToCharArray(); char[] array_b = b.ToCharArray(); while (i < array_a.Length && j < array_b.Length) { //For the current char in a, we check for the equal in b int index = b.IndexOf(array_a[i], j); if (index < 0) //not found, this char should be removed { res.Add(new CharResult() { c = array_a[i], state = CharState.Remove }); i++; } else { //we add all the chars between B's current index and the index while (j < index) { res.Add(new CharResult() { c = array_b[j], state = CharState.Add }); j++; } //then we say the current is the same res.Add(new CharResult() { c = array_a[i], state = CharState.Equal }); i++; j++; } } while (i < array_a.Length) { //b is now empty, we remove the remains res.Add(new CharResult() { c = array_a[i], state = CharState.Remove }); i++; } while (j < array_b.Length) { //a has been treated, we add the remains res.Add(new CharResult() { c = array_b[j], state = CharState.Add }); j++; } return res; } static void DisplayResults(List<CharResult> results) { foreach (CharResult r in results) { Console.WriteLine($"'{r.c}' - {r.state}"); } } } 。通过使用该算法,您可以精确计算两个字符串之间的相似率,也可以回溯算法以获得第二个字符串的变化链。该算法也是自然语言处理的重要指标。

还有其他一些好处,需要时间学习。

在此链接中有一个Levenshtein距离的C#版本:

Levenshtein Distance

© www.soinside.com 2019 - 2024. All rights reserved.