如何对短文本进行主题建模

问题描述 投票:0回答:1

我有一篇文本,每个响应有 2-3 个长句子。对其进行主题建模的最佳方法是什么。

我尝试过使用 umap+hdbscan 进行 lda、bert 主题和聚类嵌入,但没有给出令人满意的结果。我想要精细的主题,并且我想使用相同的模型来映射到新文本的标签

topic-modeling
1个回答
0
投票
import pandas as pd
from nltk.tokenize import word_tokenize

# Sample data
data = {'text': ["This is a short sentence.", "Another sentence with more words.", "A longer sentence with many more words than eight."]}
df = pd.DataFrame(data)

# Define a function to tokenize sentences and create 8-grams if needed
def tokenize_and_create_grams(sentence):
    words = word_tokenize(sentence)
    if len(words) <= 8:
        return [sentence]
    else:
        ngrams = [words[i:i+8] for i in range(0, len(words), 8)]
        return [' '.join(gram) for gram in ngrams]

# Apply the function to the DataFrame
df['grams'] = df['text'].apply(tokenize_and_create_grams)

# Show the resulting DataFrame
print(df)
© www.soinside.com 2019 - 2024. All rights reserved.