pyLDAvis 错误 AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

问题描述 投票:0回答:2

我正在为我的一个项目进行主题建模,并努力将结果可视化。我认为程序是正确的。特别是当我运行这条线时

vis = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
pyLDAvis.show(vis)

我得到这个错误:

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

我觉得这很奇怪,无法弄清楚,因为程序是正确的,我能够创建一个 lda 模型。

我创建模型的方式如下

import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import os
#print(os.listdir("../input"))

# Plotly based imports for visualization
import chart_studio.plotly as py


from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff


# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
!python -m spacy download it_core_news_sm

进行中:

# Create a custom stopword list
custom_stop_words = []

# Add spaCy's built-in stop words to the list
custom_stop_words.extend(spacy.lang.it.stop_words.STOP_WORDS)

def spacy_tokenizer(sentence):
    # Use the Italian model to tokenize the sentence
    mytokens = nlp(sentence)

    # Use lemmatization to lowercase, strip, and remove stop words and punctuation
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in custom_stop_words and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])

    return mytokens

tqdm.pandas()
df["processed_description"] = df["content"].progress_apply(spacy_tokenizer)

# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words=custom_stop_words, lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(df["processed_description"])
# Latent Dirichlet Allocation Model
NUM_TOPICS = 10
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

我遇到的问题就在这里

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash

输出总是 AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names' 我也尝试更新库,但它不起作用

相反,如果我这样绘制它,它就会这样做

svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'markers',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names_out(),
    hovertext = vectorizer.get_feature_names_out(),
    hoverinfo = 'text' 
)
data = [trace]
iplot(data, filename='scatter-mode')
python lda topic-modeling pyldavis
2个回答
1
投票

在最新版本中修复,这里: https://github.com/bmabey/pyLDAvis/pull/235


0
投票

使用较新版本的 scikit-learn >= 1.2 时会发生此错误。要解决此问题,只需替换任何涉及

的逻辑
import pyLDAvis.sklearn
...
pyLDAvis.sklearn.prepare

import pyLDAvis.lda_model
...
pyLDAvis.lda_model.prepare

这应该可以解决问题。有关此here.

的更多背景信息
© www.soinside.com 2019 - 2024. All rights reserved.