Python:将字典值中的短语匹配到句子(字典键)并根据匹配结果输出

问题描述 投票:1回答:1

我有一本字典,其中每个键是一个句子,值是该句子中的特定单词或短语。

例如:

dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

我希望根据短语是否在字典值中来标记每个句子的输出。

在此示例中,输出为(其中0不在值中,而1在值中)

*
it 0
is 0
lovely weather 1 (combined because it's a phrase)
and 0
it is kind of warm 1 (combined because it's a phrase)
*
and 0
the 0
weather 0
is 0
rainy and cold 1 (combined because it's a phrase)
...(and so on)...

我可以使类似的东西起作用,但是只能通过对短语中的单词数进行硬编码:

for k,v in dict1.items():
   words_in_val = v.split()
   if len(words_in_val) == 1:
      words = k.split()
      for each_word in words:
             if v == each_word:
                   print(each_word + '\t' + '1')
             else:
                   print(each_word + '\t' + '0')


     if len(words_in_val) == 2::
         words = k.split()
         for index,item in enumerate(words[:-1]):
                if words[index] == words_in_val[0]:
                       if words[index+1] == words_in_val[1]:
                              words[index] = ' '.join(words_in_val)
                              words.remove(words[index+1])
                              ....something like this...

[我的问题是我可以看到它开始变得凌乱,而且从理论上讲,我想匹配的词组中可以包含无限数量的单词,尽管通常是<10。

有人会对如何执行此操作有更好的主意吗?

python
1个回答
0
投票

所以这就是我要做的:

from collections import defaultdict

dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

def tag_sentences(dict):
    id = 1
    tagged_results = []
    for sentence, phrases in dict.items():
        words = sentence.split()
        phrases_split = [phrase.split() for phrase in phrases]
        positions_keeper = {}
        sentence_results = [(word, 0) for word in words]
        for word_index, word in enumerate(words):
            for index, phrase in enumerate(phrases_split):
                position = positions_keeper.get(index, 0)
                if phrase[position] == word:
                    if len(phrase) > position + 1:
                        positions_keeper[index] = position + 1
                    else:
                        for i in range(len(phrase)):
                            sentence_results[word_index - i] = (sentence_results[word_index - i][0], id)
                        id = id + 1
        tagged_results.append(sentence_results)
    return tagged_results

def print_tagged_results(tagged_results):
    for tagged_result in tagged_results:
        memory = 0
        memory_sentence = ""
        for result, id in tagged_result:
            if memory != 0 and memory != id:
                print(memory_sentence + "1")
                memory_sentence = ""
            if id == 0:
                print(result, 0)
            else:
                memory_sentence += result + " "
            memory = id
        if memory != 0:
            print(memory_sentence + "1")

tagged_results = tag_sentences(dict1)
print_tagged_results(tagged_results)

这基本上是在做以下事情:

  1. 首先,我以[(it, 0), (is, 0), (lovely, 0) ...]的格式列出标签列表>
  2. 在标记列表中,我标记为0 =>不在一个组中,而其他整数不一起分组(带有标签1的单词分组在一起,带有标签2的单词分组在一起)]
  3. 我反复遍历每个单词,并在与短语开头匹配的地方对其进行标记
  4. 如果它是短语的结尾,我会标记该单词以及所有过去使用该ID相同的短语匹配的单词
  5. 如果不是结束,我将保持位置并开始下一次迭代。
  6. 最后,我有一个格式为[(it, 0), (is, 0), (lovely, 1) ... (kind,2), (of, 2), ...]的标记列表>
  7. 如果一个短语是另一个短语的副词,则将不起作用,但您在示例中从未提及过它如何应对这种情况。

© www.soinside.com 2019 - 2024. All rights reserved.