如何从一列数据框计算tfidf分数并提取具有最小分数阈值的单词

问题描述 投票:1回答:1

我已经采用了一列数据集,其中每行都有文本形式的描述。我试图找到tf-idf大于某个值n的单词。但是代码给出了一个分数矩阵,我如何对分数进行排序和过滤,并查看相应的单词。

tempdataFrame = wineData.loc[wineData.variety == 'Shiraz', 
'description'].reset_index()
tempdataFrame['description'] = tempdataFrame['description'].apply(lambda 
x: str.lower(x))

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(tempDataFrame['description'])

Sample Data:
description
This tremendous 100% varietal wine hails from Oakville and was aged over 
three years in oak. Juicy red-cherry fruit and a compelling hint of caramel 
greet the palate, framed by elegant, fine tannins and a subtle minty tone in 
the background. Balanced and rewarding from start to finish, it has years 
ahead of it to develop further nuance. Enjoy 2022–2030.
pandas tf-idf
1个回答
0
投票

如果没有葡萄酒描述的完整数据框列,您提供的样本数据将分为三个句子,以创建一个数据框,其中一列名为“描述”,三行。然后将列传递给tf-idf进行分析,并创建包含要素及其分数的新数据框。随后使用熊猫过滤结果。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

doc = ['This tremendous 100% varietal wine hails from Oakville and was aged over \
three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel \
greet the palate, framed by elegant, fine tannins and a subtle minty tone in \
the background.', 'Balanced and rewarding from start to finish, it has years \
ahead of it to develop further nuance. Enjoy 2022–2030.']

df_1 = pd.DataFrame({'Description': doc})

tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(df_1['Description'])

# New data frame containing the tfidf features and their scores
df = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())

# Filter the tokens with tfidf score greater than 0.3
tokens_above_threshold = df.max()[df.max() > 0.3].sort_values(ascending=False)

tokens_above_threshold
Out[29]: 
wine          0.341426
oak           0.341426
aged          0.341426
varietal      0.341426
hails         0.341426
100           0.341426
oakville      0.341426
tremendous    0.341426
nuance        0.307461
rewarding     0.307461
start         0.307461
enjoy         0.307461
develop       0.307461
balanced      0.307461
ahead         0.307461
2030          0.307461
2022â         0.307461
finish        0.307461
© www.soinside.com 2019 - 2024. All rights reserved.