如何有效地删除Python中重复的连续短语(对于大量文本),例如:
original = "hello ... hello ... hi test this test this"
new = "hello ... hi test this"
original_2 = "hello ... space hello ..."
new_2 = "hello ... space hello ..."
看到了许多复制单词的解决方案,但找不到太多用于复制单词对或三元组等的解决方案。
正如上面评论中所建议的,拆分可用于实现此目标,因此这里有一个示例脚本,可以实现这一目标
original = "hello ... hello ... hi test this test this"
words = original.split()
seen = {}
unique_words = []
for word in words:
if word not in seen:
unique_words.append(word)
seen[word] = True
new = ' '.join(unique_words)
print(new)
这适用于这个简短的短语,但可以针对大字符串进行调整