带有很多标点符号的Gensim短语处理语句

问题描述 投票:0回答:1

现在我正尝试使用gensim Phrases来基于我自己的语料库学习短语/特殊含义。

假设我拥有与汽车品牌相关的语料库,通过删除标点符号停用词为句子加符号],例如:

sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
...

通过这种方式,我想使用gensim Phrases进行学习,以便输出看起来像:

from gensim.models import Phrases
sents = [sent1, sent2, sent3, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)

for sent in sents:
    print(bigram [sent])

# Ouput should be like:
['aston_martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
...

但是,如果很多句子带有标点符号:

sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
sent4 = 'jaguar, aston martin, mini cooper are british car brand'
sent5 = 'In all brand, I love jaguar, aston martin and mini cooper'
...

然后输出看起来像:

from gensim.models import Phrases
sents = [sent1, sent2, sent3, sent4, sent5, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)

for sent in sents:
    print(bigram [sent])

# Ouput should be like:
['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston', 'martin_mini', 'cooper', 'british', 'car', 'brand']
['all', 'brand', 'love', 'jaguar', 'aston', 'martin_mini', 'cooper']
...

在这种情况下,我应该如何处理带有大量标点符号的句子,以防止出现martin_mini大小写并使输出看起来像这样:

['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston_martin', 'mini_cooper', 'british', 'car', 'brand'] # Change
['all', 'brand', 'love', 'jaguar', 'aston_martin', 'mini_cooper'] # Change
...

非常感谢您的帮助!

现在,我正在尝试使用gensim短语,以便根据自己的语料库学习短语/特殊含义。假设通过删除标点符号和停用词,我有了与汽车品牌相关的语料库,...

python nlp gensim phrase
1个回答
0
投票

标点符号可能不是导致您效果不理想的主要因素。

© www.soinside.com 2019 - 2024. All rights reserved.