在 Python 中使用字计数器低估了结果

Question

作为一个完整的前言，我是一个初学者，正在学习。但是，这是我的产品评论表的示例架构。

记录_ID	产品_ID	评论评论
1234	89847457	我喜欢这个产品，它发货快而且很舒服

这是我的代码。它为我提供了所有评论的总字数，以及尝试获取更多上下文的另一个短语计数……即（“脆弱”、“紧身”）如果衬衫合身且质量脆弱。该脚本会编写一个新的 Excel 文档，其中包含两者的计数。

import pandas as pd
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import string
from collections import Counter
from nltk.util import ngrams
import nltk
nltk.download('punkt')

df = pd.read_excel('productsvydata.xlsx')

def preprocess_text(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.lower() 
    text = text.translate(translator)
    return text

word_counts = {}
phrase_counts = {}

unique_product_ids = df["Product_ID"].unique()

# Set the number of top words and phrases you want to keep
top_n = 100

for selected_product_id in unique_product_ids:
    selected_comments_df = df[df["Product_ID"] == selected_product_id]
    selected_comments = ' '.join(selected_comments_df["Product Review Comment"].astype(str))
    selected_comments = preprocess_text(selected_comments)
    if not selected_comments.strip():
        continue
    tokenized_words = nltk.word_tokenize(selected_comments)
    stop_words = set(ENGLISH_STOP_WORDS)
    filtered_words = [word for word in tokenized_words if word not in stop_words]
    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    max_phrase_length = 4
    phrases = [phrase for n in range(2, max_phrase_length + 1) for phrase in ngrams(lemmatized_words, n)]
    word_counter = Counter(lemmatized_words)
    phrase_counter = Counter(phrases)

    # Get the top N words and phrases
    top_words = dict(word_counter.most_common(top_n))
    top_phrases = dict(phrase_counter.most_common(top_n))

    # Extract record_id for each Product_ID
    record_ids = selected_comments_df["record_id"].values[0]

    word_counts[(selected_product_id, record_ids)] = top_words
    phrase_counts[(selected_product_id, record_ids)] = top_phrases

word_result_data = []
phrase_result_data = []

for (product_id, record_id), top_words in word_counts.items():
    for word, count in top_words.items():
        word_result_data.append([product_id, record_id, word, count])
for (product_id, record_id), top_phrases in phrase_counts.items():
    for phrase, count in top_phrases.items():
        phrase_result_data.append([product_id, record_id, phrase, count])

word_df = pd.DataFrame(word_result_data, columns=['Product_ID', 'record_id', 'Word', 'Count'])
phrase_df = pd.DataFrame(phrase_result_data, columns=['Product_ID', 'record_id', 'Phrase', 'Count'])

word_df.to_csv('top_words_counts.csv', index=False)
phrase_df.to_csv('top_phrases_counts.csv', index=False)

我使用 top_n = 100 来绕过导出中的前 100 个单词，因为有超过 20,000 行数据，如果我执行所有单词和短语，则该程序将无法运行。它需要同时使用产品 ID 和记录 ID，因为这就是它在我的工作工具中加入的内容。

问题是我觉得结果被低估了。我想知道这是否与标记化有关。例如，现在我的导出数据中有 9 个“客户”一词的实例。在计数短语中，（“客户”、“服务”）出现的次数更少。如果我只是通过原始文档中的原始产品评论来控制 F，就会有更多的人谈论客户服务。处理过程中出了点问题，但我不知道是什么。

有人能够帮助提出更好地优化此代码并产生更多结果的方法吗？这是非常基本的 NLP，但同样，我是新手，我想学习，但我的输出遇到了障碍。

Answer 1

虽然这会让词形还原变得更加困难，但我总是建议使用 sklearn 的 CountVectorizer，其中包括停用词删除，而不是使用 nltk 和基础 python 的艰难方式做事。

此外，在你的预处理中，你可以使用

apply

方法来更有效地一次性对整个评论列进行预处理。我建议你不需要加入每个产品的所有评论，然后标记化；相反，对每个记录进行单词/n-gram 计数，然后只需按产品 ID 分组来对计数求和即可：

# Preprocess the review column
df['Product Review Comment'] = df['Product Review Comment'].apply(preprocess_text)

# Instantiate CountVectorizer
cv = CountVectorizer(stop_words='english', min_df=100, ngram_range=(1,5))

# Create document-term dataframe of word counts per record 
dtm = cv.fit_transform(df['Product Review Comment'])
dtm_df = pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())

# Join to the original data
joined_df = pd.concat([df, dtm_df], axis=1)

# Find the sum of word counts per product
word_count_df = joined_df.groupby('Product_ID').sum().drop('Record_ID', axis=1).reset_index()

# Flatten / convert DTM from wide format to long format
long_df = pd.melt(word_count_df, id_vars=['Product_ID'], var_name='var', value_name='value')

# Find 100 top words per product
long_df.groupby('Product_ID').head(100)

对于词形还原，您需要编写自己的分词器函数，其中包含它并传递到

tokenizer

参数中的 CountVectorizer 中。另外，如果您有一个大型数据集，您可能希望将

min_df

设置得更高，这样您的文档术语矩阵就不会变得太大，但是，如果您只是担心最重要的术语（在整个数据集上），则比这更重要应该没问题。

希望这有帮助！

在 Python 中使用字计数器低估了结果

问题描述投票：0回答：1

1个回答

最新问题

在 Python 中使用字计数器低估了结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1