如何在NLP中获得“单词”的重要性（TFIDF + Logistic回归）

Question

我具有获取tfidf功能的功能，例如：

def get_tfidf_features(data, tfidf_vectorizer=None, ngram_range=(1,2)):
    """ Creates tfidf features and returns them as sparse matrix. If no tfidf_vectorizer is given, 
    the function will train one."""

    if tfidf_vectorizer is not None:
        tfidf = tfidf_vectorizer.transform(data.Comment_text)
    else:
        # only add words to the vocabulary that appear at least 200 times
        tfidf_vectorizer = TfidfVectorizer(min_df=700, ngram_range=ngram_range, stop_words='english')
        tfidf = tfidf_vectorizer.fit_transform(data.Comment_text)        

    tfidf = pd.SparseDataFrame(tfidf.toarray()).to_sparse()
    tfidf.applymap(lambda x: round(x, 4))
    tfidf_features = ['tfidf_' + word for word in tfidf_vectorizer.get_feature_names()]
    tfidf.columns = tfidf_features
    data = data.reset_index().join(tfidf).set_index('index')

    return data, tfidf_vectorizer, tfidf_features    

X_train, tfidf_vectorizer, tfidf_features = get_tfidf_features(X_train)

我像这样应用了简单的逻辑回归：

logit = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
logit.fit(X_train.loc[:, features].fillna(0), X_train['Hateful_or_not'])
preds = logit.predict(X_test.loc[:, features].fillna(0))

我正在获得这样的功能重要性：

 logit.coef_

但是这给了我“列”而不是单词的特征重要性>

我具有获取tfidf功能的功能，例如：def get_tfidf_features（data，tfidf_vectorizer = None，ngram_range =（1,2））：“”“创建tfidf功能并将其作为稀疏矩阵返回。如果没有，...] >

Answer 1

logit.coef_确实为您提供了每个单词特征（或双字母组）的系数。它将返回一个包含len(features)个元素的数组，其中features中第i个位置的单词的系数将位于logit.coef_数组中的第i个位置。

示例：

如何在NLP中获得“单词”的重要性（TFIDF + Logistic回归）

问题描述投票：1回答：1

1个回答

最新问题

如何在NLP中获得“单词”的重要性（TFIDF + Logistic回归）

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1