找到两个字符串之间的公共子串

问题描述 投票:48回答:13

我想比较2个字符串并保持匹配,在比较失败的地方分开。

所以,如果我有2个字符串 -

string1 = apples
string2 = appleses

answer = apples

另一个例子,因为字符串可能有多个单词。

string1 = apple pie available
string2 = apple pies

answer = apple pie

我确信有一种简单的Python方法可以做到这一点,但我无法解决,任何帮助和解释都表示赞赏。

python string algorithm time-complexity dynamic-programming
13个回答
14
投票

它被称为最长公共子串问题。在这里,我提出了一个简单易懂但效率低下的解决方案。由于该算法的复杂度为O(N ^ 2),因此需要很长时间才能为大字符串生成正确的输出。

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print longestSubstringFinder("apple pie available", "apple pies")
print longestSubstringFinder("apples", "appleses")
print longestSubstringFinder("bapples", "cappleses")

产量

apple pie
apples
apples

0
投票

返回第一个最长的公共子字符串:

def compareTwoStrings(string1, string2):
    list1 = list(string1)
    list2 = list(string2)

    match = []
    output = ""
    length = 0

    for i in range(0, len(list1)):

        if list1[i] in list2:
            match.append(list1[i])

            for j in range(i + 1, len(list1)):

                if ''.join(list1[i:j]) in string2:
                    match.append(''.join(list1[i:j]))

                else:
                    continue
        else:
            continue

    for string in match:

        if length < len(list(string)):
            length = len(list(string))
            output = string

        else:
            continue

    return output

0
投票

首先是一个辅助功能,改编自itertools pairwise recipe以产生子串。

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

然后一个函数迭代子字符串,最长的第一个,并测试成员资格。 (效率未考虑)

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>> 

0
投票

这是称为“最长序列发现者”的课堂问题。我已经给出了一些对我有用的简单代码,我的输入也是一个序列的列表,也可以是一个字符串,可能会帮助你:

def longest_substring(list1,list2):
    both=[]
    if len(list1)>len(list2):
        small=list2
        big=list1
    else:
        small=list1
        big=list2
    removes=0
    stop=0
    for i in small:
        for j in big:
            if i!=j:
                removes+=1
                if stop==1:
                    break
            elif i==j:
                both.append(i)
                for q in range(removes+1):
                    big.pop(0)
                stop=1
                break
        removes=0
    return both

0
投票
def LongestSubString(s1,s2):
    left = 0
    right =len(s2)
    while(left<right):
        if(s2[left] not in s1):
            left = left+1
        else:
            if(s2[left:right] not in s1):
                right = right - 1
            else:
                return(s2[left:right])

s1 = "pineapple"
s2 = "applc"
print(LongestSubString(s1,s2))

115
投票

为了完整起见,标准库中的difflib提供了大量的序列比较实用程序。例如find_longest_match在字符串上使用时找到最长的公共子字符串。使用示例:

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a: match.a + match.size])  # -> apple pie
print(string2[match.b: match.b + match.size])  # -> apple pie

32
投票
def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())
>>> common_start("apple pie available", "apple pies")
'apple pie'

或者有点奇怪的方式:

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

哪个可能更具可读性

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

8
投票

人们也可以考虑使用os.path.commonprefix来处理字符,因此可以用于任何字符串。

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

5
投票

Evo's相同,但要比较任意数量的字符串:

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())

3
投票

使用第一个答案修复错误:

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp=0
            match=''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp+=1
            if (len(match) > len(answer)):
                answer = match
    return answer

print longestSubstringFinder("dd apple pie available", "apple pies")
print longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print longestSubstringFinder("bapples", "cappleses")
print longestSubstringFinder("apples", "apples")

1
投票

尝试:

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

它从两个字符串的开头进行比较。


1
投票

这不是最有效的方法,但这是我能想到的,它的工作原理。如果有人可以改进它,请做。它的作用是制作一个矩阵并将1放在字符匹配的位置。然后它扫描矩阵以找到1s的最长对角线,跟踪它的开始和结束位置。然后它返回输入字符串的子字符串,其中起始位置和结束位置作为参数。

注意:这只找到一个最长的公共子字符串。如果有多个,你可以创建一个数组来存储结果并返回它。此外,它区分大小写(苹果派,苹果派)将返回pple馅饼。

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer

1
投票
def matchingString(x,y):
    match=''
    for i in range(0,len(x)):
        for j in range(0,len(y)):
            k=1
            # now applying while condition untill we find a substring match and length of substring is less than length of x and y
            while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
                if len(match) <= len(x[i:i+k]):
                   match = x[i:i+k]
                k=k+1
    return match  

print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie     
© www.soinside.com 2019 - 2024. All rights reserved.