现在我正尝试使用gensim Phrases
来基于我自己的语料库学习短语/特殊含义。
假设我拥有与汽车品牌相关的语料库,通过删除标点符号和停用词,为句子加符号],例如:
sent1 = 'aston martin is a car brand' sent2 = 'audi is a car brand' sent3 = 'bmw is a car brand' ...
通过这种方式,我想使用
gensim Phrases
进行学习,以便输出看起来像:
from gensim.models import Phrases sents = [sent1, sent2, sent3, ...] sents_stream = [sent.split() for sent in sents] bigram = Phrases(sents_stream) for sent in sents: print(bigram [sent]) # Ouput should be like: ['aston_martin', 'car', 'brand'] ['audi', 'car', 'brand'] ['bmw', 'car', 'brand'] ...
但是,如果很多句子带有标点符号:
sent1 = 'aston martin is a car brand' sent2 = 'audi is a car brand' sent3 = 'bmw is a car brand' sent4 = 'jaguar, aston martin, mini cooper are british car brand' sent5 = 'In all brand, I love jaguar, aston martin and mini cooper' ...
然后输出看起来像:
from gensim.models import Phrases sents = [sent1, sent2, sent3, sent4, sent5, ...] sents_stream = [sent.split() for sent in sents] bigram = Phrases(sents_stream) for sent in sents: print(bigram [sent]) # Ouput should be like: ['aston', 'martin', 'car', 'brand'] ['audi', 'car', 'brand'] ['bmw', 'car', 'brand'] ['jaguar', 'aston', 'martin_mini', 'cooper', 'british', 'car', 'brand'] ['all', 'brand', 'love', 'jaguar', 'aston', 'martin_mini', 'cooper'] ...
在这种情况下,我应该如何处理带有大量标点符号的句子,以防止出现
martin_mini
大小写并使输出看起来像这样:
['aston', 'martin', 'car', 'brand'] ['audi', 'car', 'brand'] ['bmw', 'car', 'brand'] ['jaguar', 'aston_martin', 'mini_cooper', 'british', 'car', 'brand'] # Change ['all', 'brand', 'love', 'jaguar', 'aston_martin', 'mini_cooper'] # Change ...
非常感谢您的帮助!
现在,我正在尝试使用gensim短语,以便根据自己的语料库学习短语/特殊含义。假设通过删除标点符号和停用词,我有了与汽车品牌相关的语料库,...
标点符号可能不是导致您效果不理想的主要因素。