用于文本聚类分析的tf-idf

Question

我想对数据框中的df['Texts']列中包含的小文本进行分组。要分析的句子的示例如下：

    Texts

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

由于我知道TF-IDF对于群集很有用，因此我一直在使用以下代码行（通过遵循社区中的先前问题）：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)

kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed

但是，由于我正在考虑数据框中的一列，因此我不知道如何应用上述功能。您能帮我吗？

Answer 1

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

您只需要用df替换all_text。最好先构建一个管道，然后同时应用vectorizer和Kmeans。

用于文本聚类分析的tf-idf

问题描述投票：0回答：1

1个回答

最新问题

用于文本聚类分析的tf-idf

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1