Python:子序列搜索

问题描述 投票:0回答:1

我有一个序列'abccabac'和一个子序列'abc'。我需要得到序列中子序列'abc'的所有出现的索引。这种记忆效率高的方式是什么?

例:

输入:

**sequence**: 'abccabac' **subsequence**: 'abc'  

输出:

 012   
 013  
 017  
 057  
 457
python sequence sequences subsequence
1个回答
2
投票

我能想到的绝对最直接的方式是使用qazxsw poi

itertools.combinations

如果您的实际序列包含许多甚至不在子序列中的字符,那么在启动import itertools sequence = "abccabac" subsequence = "abc" for combo in itertools.combinations(enumerate(sequence), len(subsequence)): if "".join(pair[1] for pair in combo) == subsequence: print([pair[0] for pair in combo]) 之前过滤掉不相关的字符肯定会提供显着的效率提升:

combinations

除了加入每个组合,你可以使用char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence) for combo in itertools.combinations(char_generator, len(subsequence)): ... ,它只检查字符,直到发现一个不相等:

all

这是我刚刚放在一起的另一种选择,我想这个算法可以进一步优化,但这是我能想到的最好的。 char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence) for combo in itertools.combinations(char_generator, len(subsequence)): if all(pair[1]==char for pair,char in zip(combo,subsequence)): print([pair[0] for pair in combo]) 为子序列的每个字母创建一个索引列表,然后递归的每个级别的sub_sequence_generator遍历在最后一个字母的指示之后开始的子序列的下一个字母的所有索引。

sub_helper

在内存方面,这将为子序列的每个字符创建一个列表,其中包含该字符所在的主序列中的索引。这些列表的元组,以及不断更新的索引列表和每次组合import itertools import bisect def sub_helper(cur_i, combo, all_options): #cur_i is the index of the letter in the subsequence #cur_idx is the index of the sequence which contains that letter if cur_i>0: prev_i = combo[cur_i-1] else: prev_i = -1 cur_options = all_options[cur_i] start_i = bisect.bisect(cur_options,prev_i) cur_i_gen = itertools.islice(cur_options,start_i,None) if cur_i+1 == len(all_options): #last one: for cur_idx in cur_i_gen: combo[cur_i] = cur_idx yield tuple(combo) else: for cur_idx in cur_i_gen: combo[cur_i] = cur_idx yield from sub_helper(cur_i+1, combo, all_options) def sub_sequence_generator(sequence,sub_seq): indices_map = {c:[] for c in set(sub_seq)} #create a list for each character for i,c in enumerate(sequence): try: indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples except KeyError: pass # now we have indices for each character of the sub_sequence # we just need to put them in a sequence that corelates to the subsequence chr_indices = tuple(indices_map[c] for c in sub_seq) return sub_helper(0,[None]*len(chr_indices), chr_indices) sequence = "abccabac" subsequence = "abc" for idxs in sub_sequence_generator(sequence,subsequence): print(idxs) ed时的元组,以及像yield这样的迭代器,所以这是非常高效的内存,但是因为我没有对其进行广泛测试,所以我不能给出任何保证它没有bug。

© www.soinside.com 2019 - 2024. All rights reserved.