通过限制语料库文档的字大小来进行潜在Dirichlet分配（LDA）性能

Question

我通过在python（gensim包）中使用Latent Dirichlet分配（LDA）来生成yelp data客户评论集的主题。在生成令牌时，我只从评论中选择长度> = 3的单词（通过使用RegexpTokenizer）：

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)

这将允许我们在创建语料库文档时过滤掉长度小于3的嘈杂单词。

过滤掉这些词会如何影响LDA算法的性能？

Answer 1

一般来说，对于英语，一个和两个字母单词不会添加有关该主题的信息。如果它们没有增加价值，则应在预处理步骤中将其移除。与大多数算法一样，较少的数据会加快执行时间。

Answer 2

小于长度3的单词被视为停用词。 LDA构建主题，因此想象一下您生成此主题：

[我，他，她，他们，我们，和，或者，]

相比：

[鲨鱼，公牛，大白鲨，双髻鲨，鲸鱼]

哪个更有说服力？这就是删除停用词很重要的原因。这就是我这样做的方式：

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        newStopWords = ['your_stopword1', 'your_stopword2']
        if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

通过限制语料库文档的字大小来进行潜在Dirichlet分配（LDA）性能

问题描述投票：0回答：2

2个回答

最新问题

通过限制语料库文档的字大小来进行潜在Dirichlet分配（LDA）性能

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2