如何检查标记化句子列表中的特定单词,然后将它们标记为 1 或 0?

问题描述 投票:0回答:2

我试图将列表中的特定单词映射到另一个标记化句子列表,如果在句子中找到该单词,则我将 1 附加到其类别列表,将 0 附加到其余类别。 例如:

category_a=["stain","sweat","wet","burn"]
category_b=["love","bad","favorite"]
category_c=["packaging","delivery"]
tokenized_sentences=['this deodorant does not stain my clothes','i love this product','i sweat all day']
for i in category_a:
    for j in tokenized_sentences:
          if(i in nltk.word_tokenize(j)):
                 list_a.append(j)
                 tag_a,tag_b,tag_c=([],)*3
                 tag_a.append(1)
                 tag_b.append(0)
                 tag_c.append(0)
                 final=tag_a+tag_b+tag_c

类别_b 和类别_c 类似

Expected output:this deodorant does not stain my clothes-->[1,0,0]
                i love this product-->[0,1,0]
                i sweat all day-->[1,0,0]
                great fragrance-->[0,0,0]

我得到每个句子的重复输出,例如:我喜欢这个产品-->[1,0,0] 我喜欢这个产品-->[1,0,0] 和 也喜欢这样:[我喜欢这个产品,我整天出汗]-->[0,1,0]

Also, if a sentence has words from two different categories Ex: 'this product does not stain and i love it'
the expected output would be [1,1,0] 

如何获得所需格式的输出?

python list nltk tokenize
2个回答
0
投票

这应该可以完成工作:

category_b = ["love", "bad", "favorite"]
category_c = ["packaging", "delivery"]
sentences = ['this deodorant does not stain my clothes', 'i love this product', 'i sweat all day']

results = []

for sentence in sentances:
    cat_a = 0
    cat_b = 0
    cat_c = 0
    for word in sentance.split():
        if cat_a == 0:
            cat_a = 1 if word in category_a else 0
        if cat_b == 0:
            cat_b = 1 if word in category_b else 0
        if cat_c == 0:
            cat_c = 1 if word in category_c else 0

    results.append((sentance, [cat_a, cat_b, cat_c]))


print(results)

此代码将检查每个句子是否包含每个给定类别的单词,并将它们(句子和结果)以元组的形式保存。所有元组都将附加到名为 results 的列表中。

输出:

[('this deodorant does not stain my clothes', [1, 0, 0]), ('i love this product', [0, 1, 0]), ('i sweat all day', [1, 0, 0])]

0
投票

您的比较顺序已关闭 - 我不明白这一点

         tag_a,tag_b,tag_c=([],)*3
         tag_a.append(1)
         tag_b.append(0)
         tag_c.append(0)
         final=tag_a+tag_b+tag_c

你所做的事情 - 你永远不会检查正确的事情。

这就是它的工作原理:

import nltk

category_a=["stain","sweat","wet","burn"]
category_b=["love","bad","favorite"]
category_c=["packaging","delivery"]
tokenized_sentences=['this deodorant does not stain my clothes',
                     'i love this product','i sweat all day']
r = []

for j in tokenized_sentences:
    r = []
    for c in [category_a,category_b,category_c]:
        print(nltk.word_tokenize(j), c) # just a debug print whats compared here
        if any( w in c for w in nltk.word_tokenize(j)):
            r.append(1)
        else:
            r.append(0)
    print(r) # print the result

输出:

['this', 'deodorant', 'does', 'not', 'stain', 'my', 'clothes'] ['stain', 'sweat', 'wet', 'burn']
['this', 'deodorant', 'does', 'not', 'stain', 'my', 'clothes'] ['love', 'bad', 'favorite']
['this', 'deodorant', 'does', 'not', 'stain', 'my', 'clothes'] ['packaging', 'delivery']
[1, 0, 0]

['i', 'love', 'this', 'product'] ['stain', 'sweat', 'wet', 'burn']
['i', 'love', 'this', 'product'] ['love', 'bad', 'favorite']
['i', 'love', 'this', 'product'] ['packaging', 'delivery']
[0, 1, 0]

['i', 'sweat', 'all', 'day'] ['stain', 'sweat', 'wet', 'burn']
['i', 'sweat', 'all', 'day'] ['love', 'bad', 'favorite']
['i', 'sweat', 'all', 'day'] ['packaging', 'delivery']
[1, 0, 0] 
© www.soinside.com 2019 - 2024. All rights reserved.