我无法找到最快的方法来检查子字符串是否作为整个单词或术语存在于字符串中。目前,我正在使用 RegEx,但我需要执行数千次验证,而 RegEx 非常慢。
对此做出回应的方法有很多。更简单的验证方法是
substring in string
:
substring = "programming"
string = "Python is a high-level programming language"
substring in string
>>> True
另一方面,当我们需要将子字符串作为整个单词或术语查找时,这是一个天真的解决方案:
substring = "program"
string = "Python is a high-level programming language"
substring in string
>>> True
另一种解决方案是将字符串拆分为单词列表并验证子字符串是否在该列表中:
substring = "program"
string = "Python is a high-level programming language"
substring in string.split()
>>> False
尽管如此,如果子字符串是一个术语,它就不起作用。要解决这个问题,另一个解决方案是使用 RegEx:
import re
substring = "high-level program"
string = "Python is a high-level programming language"
re.search(r"\b{}\b".format(substring), string) != None
>>> False
但是,我最大的问题是,如果您需要执行数千次验证,该解决方案真的很慢。
为了缓解这个问题,我创建了一些方法,尽管它们比 RegEx 更快(对于我需要的用途),但仍然比
substring in string
慢很多:
substring = "high-level program"
string = "Python is a high-level programming language"
all([word in string.split() for word in substring.split()])
>>> False
虽然简单,但上面的方法并不适合,因为它忽略了子字符串的词序,如果子字符串是
True
,则返回 "programming high-level"
,这与 RegEx 中的解决方案不同。因此,我创建了另一种方法来验证子字符串是否在 ngram 列表中,其中每个 ngram 具有与子字符串相同的单词数:
from nltk import ngrams
substring = "high-level program"
string = "Python is a high-level programming language"
ngram = list(ngrams(string.split(), len(substring.split())))
substring in [" ".join(tuples) for tuples in ngram]
>>> False
编辑:这是一个不太慢的版本,使用相同的原理,但仅使用内置函数:
substring = "high-level program"
string = "Python is a high-level programming language"
length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]
substring in ngrams
>>> False
有人知道一种更快的方法来查找字符串中的子字符串作为整个单词或术语吗?
简单地循环字符串并根据子串长度拼接字符串,并将拼接字符串与子串进行比较,如果相等则返回True。
插图*
strs = "Coding"
substr = "ding"
slen = 4
i = 0
check = strs[i:slen+i]==substr
# 1st iteration
strs[0:4+0] == ding
codi == ding # False
# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False
# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True
解决方案
def str_exist(string, substring, slen):
for i in range(len(string)):
if string[i:slen+i] == substring:
return True
return False
substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)
print(str_exist(string, substring, slen))
输出
True
看看这个。我在代码中添加了注释,以便更好地理解该算法的作用。
def check_substr(S: str, sub_str: str) -> bool:
"""
This function tells whether the given sub-string
in a string is present or not.
Parameters
S: str: The original string
sub_str: str: The sub-string to be checked
Returns
result: boolean: Whether the string is present or not
"""
i = 0
pointer = 0
while (i < len(S)):
# This means that we are already in that word
# whose sub-part is already matched. For eg:
# `program` in `programming`. Therefore we are
# going to skip the rest of the word and check
# the next word instead.
if (S[i] != ' ' and pointer == len(sub_str)):
while (i < len(S) and S[i] != ' '):
i += 1
i += 1
pointer = 0
if (i >= len(S)):
break
# If we encounter a space, we check whether we
# have already found the sub-string or not.
elif (S[i] == ' ' and pointer == len(sub_str)):
break
if (S[i] == sub_str[pointer]):
pointer += 1
else:
# If the current element of the original
# string matched with the first element of
# the sub-string then we increment the
# pointer by 1. Otherwise we set it to 0.
pointer = 1 if (S[i] == sub_str[0]) else 0
i += 1
return pointer == len(sub_str)
S = "Python is a high-level programming"
print(check_substr(S, "high-level program"))
print(check_substr(S, "programming language"))
False
False
O(n)
正如 @PGHE 在评论中指出的那样,我们还可以检查标点符号,而不仅仅是空格。由于OP没有提到任何有关标点符号的内容,所以我保持这个答案不变。
在子串和字符串两边添加空格,然后测试'substring in string'