我有一篇文本,每个响应有 2-3 个长句子。对其进行主题建模的最佳方法是什么。
我尝试过使用 umap+hdbscan 进行 lda、bert 主题和聚类嵌入,但没有给出令人满意的结果。我想要精细的主题,并且我想使用相同的模型来映射到新文本的标签
import pandas as pd
from nltk.tokenize import word_tokenize
# Sample data
data = {'text': ["This is a short sentence.", "Another sentence with more words.", "A longer sentence with many more words than eight."]}
df = pd.DataFrame(data)
# Define a function to tokenize sentences and create 8-grams if needed
def tokenize_and_create_grams(sentence):
words = word_tokenize(sentence)
if len(words) <= 8:
return [sentence]
else:
ngrams = [words[i:i+8] for i in range(0, len(words), 8)]
return [' '.join(gram) for gram in ngrams]
# Apply the function to the DataFrame
df['grams'] = df['text'].apply(tokenize_and_create_grams)
# Show the resulting DataFrame
print(df)