我找了很多办法,确实也找到了类似的问题。这个答案 给出了可能不属于输入列表中所有字符串的CHARACTERS的最长序列。这个答案 返回必须属于输入列表中所有字符串的最长共同序列的单词。
我正在寻找一个 以上解决方案的组合。 也就是说,我想要的是可能不在输入列表的所有词组中出现的最长的普通词组序列。
下面是一些预期的例子。
['exterior lighting', 'interior lighting']
--> 'lighting'
['ambient lighting', 'ambient light']
-->。'ambient'
['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
--> --> --> 'turn signal lamp'
['ambient lighting', 'infrared light']
-->。''
谢谢你
这段代码还将根据你的列表中最常见的单词对你所需的列表进行排序,它会计算你的列表中每个单词的数量,然后将只出现过一次的单词剪掉并进行排序。
lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
d = {}
d_words={}
for i in lst:
for j in i.split():
if j in d:
d[j] = d[j]+1
else:
d[j]= 1
for k,v in d.items():
if v!=1:
d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)
一个相当粗糙的解决方案,但我认为它的工作。
from nltk.util import everygrams
import pandas as pd
def get_word_sequence(phrases):
ngrams = []
for phrase in phrases:
phrase_split = [ token for token in phrase.split()]
ngrams.append(list(everygrams(phrase_split)))
ngrams = [i for j in ngrams for i in j] # unpack it
counts_per_ngram_series = pd.Series(ngrams).value_counts()
counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})
# discard the pandas Series
del(counts_per_ngram_series)
# filter out the ngrams that appear only once
counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]
if not counts_per_ngram_df.empty:
# populate the ngramsize column
counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()
# sort by ngramsize, ngram_char_length and then by count
counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])
# get the top ngram
top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)
return top_ngram
return ''