例如,我有一个来自数据帧的其中一列的字符串为“BHEEMAVARAM ANGADHRAM”,而来自同一数据帧的另一列的其他字符串为“B GANGADHRAM”,我想将字符串与这2个的每一行分开每当出现“空格”字符时列,并将它们存储在两个单独的列表中,并比较这两个列表是否在列表中存在相似的项目,如果项目存在,则迭代字符以找出字符中的差异。请用Python编写脚本代码并将计数的差异存储在数据帧中。在每个空格字符后拆分字符串并将字符串存储在 2 个单独的列表中。我需要在下面的代码中进行哪些更改,并需要找出更改了哪个单词的字符。
# Function to compare strings and count differences
def compare_strings(string1, string2):
list1 = string1.split() # Split string1 into a list of words
list2 = string2.split() # Split string2 into a list of words
common_words = set(list1) & set(list2) # Find common words
differences = []
for word in common_words:
index1 = list1.index(word)
index2 = list2.index(word)
# Compare characters in common words
for char1, char2 in zip(string1[index1:], string2[index2:]):
if char1 != char2:
differences.append((word, char1, char2))
return len(differences)
df['差异'] = df.apply(lambda row:compare_strings(str(row['FatherName']), str(row['Original_Fname'])), axis=1)
例如,这些是一些字符串,在characterchanges_trainee中得到的输出对于许多字符串来说都是错误的,可能是什么原因?另外,是否可以使用 spacy(NLP)来完成此任务?: 在此输入图片描述
difflib
。
计算最小差异数的示例:
import difflib
df = pd.DataFrame({'A': ['LODUGU VARDHAN KUMAR', 'TALARI KAMMAGIRI RAJU', 'CHUNCHULA CHENNAKESAVA RAO', 'UPPARA VEERANARAYANASWAMI'],
'B': ['LODOGU VARDHAN KUMAR', 'TALARI KAMBAGIRI RAJU', 'CHUNCULA CHENNAKESAVA RAO', 'UPPARA VEERANARAYANA SWAMI']})
def ndiff(a, b):
return sum(x[0] != ' ' for x in difflib.ndiff(a, b))
df['ndiff'] = [ndiff(a, b) for a, b in zip(df['A'], df['B'])]
输出:
A B ndiff
0 LODUGU VARDHAN KUMAR LODOGU VARDHAN KUMAR 2
1 TALARI KAMMAGIRI RAJU TALARI KAMBAGIRI RAJU 2
2 CHUNCHULA CHENNAKESAVA RAO CHUNCULA CHENNAKESAVA RAO 1
3 UPPARA VEERANARAYANASWAMI UPPARA VEERANARAYANA SWAMI 1