Logistic回归模型产生 100% 的准确度

Question

我已经获取了产品的亚马逊评论，现在尝试在其上训练逻辑回归模型以对客户评论进行分类。它提供 100% 的准确度。我无法理解这个问题。这是我的数据集中的一个示例：

姓名	星星	标题	日期	描述
二苯泮	5	5.0 颗星，最多 5 颗星	不适用	非常好的香水。推荐卖家 - Sun Fragrances
圣克沙阿	5	5.0 颗星，最多 5 颗星	不适用	是的
马诺兰吉达姆	5	5.0 颗星，最多 5 颗星	不适用	这款香水排名第三..好一支:)
穆克蒂卡	5	5.0 颗星，最多 5 颗星	不适用	我在 25 岁生日时收到了 Versace Bright 礼物。香味持续时间至少 24 小时。我喜欢它。这是我最好的收藏之一。
megh	5	5.0 颗星，最多 5 颗星	不适用	我有这款香水，但没有在网上买到。味道太棒了。即使你洗澡或洗衣服，它也能保持至少 2 天。我得到了很多赞美..
里亚	5	5.0 颗星，最多 5 颗星	不适用	从其他地方买的，香味很棒，纯玫瑰的味道留香很长，我的男朋友也很喜欢我买的这款香水。
manisha.chauhan0091	5	5.0 颗星，最多 5 颗星	不适用	它很轻，持久，我喜欢它
UPS	1	1.0 颗星，最多 5 颗星	不适用	绝对是假的。香味仅持续15分钟。对皮肤也非常刺激。
萨那	1	1.0 颗星，最多 5 颗星	不适用	一个骗局游戏。假冒产品。不要上当
朱莉安娜·苏亚雷斯·费雷拉	不适用	Ótimo 产品	不适用	产品 verdadeiro，com cheio da riqueza，não fixa muito，mas é delicioso。 Dura na minha pele umas 3 小时 e depois fica um cheirinho 级别...超级推荐

这是我的代码

import re
import nltk
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

# Ensure necessary NLTK datasets and models are downloaded
# nltk.download('punkt')
# nltk.download('vader_lexicon')

# Load the data
df = pd.read_csv("reviews.csv")  # Make sure to replace 'reviews.csv' with your actual file path

# Preprocess data
df['Stars'] = df['Stars'].fillna(3.0)  # Handle missing values
df['Title'] = df['Title'].str.lower()  # Standardize text formats
df['Description'] = df['Description'].str.lower()
df = df.drop(['Name', 'Date'], axis=1)  # Drop unnecessary columns
print(df)


# Categorize sentiment based on star ratings
def categorize_sentiment(stars):
    if stars >= 4.0:
        return 'Positive'
    elif stars <= 2.0:
        return 'Negative'
    else:
        return 'Neutral'


df['Sentiment'] = df['Stars'].apply(categorize_sentiment)


# Clean and tokenize text
def clean_text(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    return letters_only.lower()


def tokenize(text):
    return word_tokenize(text)


df['Clean_Description'] = df['Description'].apply(clean_text)
df['Tokens'] = df['Clean_Description'].apply(tokenize)

# Apply NLTK's VADER for sentiment analysis
sia = SentimentIntensityAnalyzer()


def get_sentiment(text):
    score = sia.polarity_scores(text)
    if score['compound'] >= 0.05:
        return 'Positive'
    elif score['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'


df['NLTK_Sentiment'] = df['Clean_Description'].apply(get_sentiment)
print("df['NLTK_Sentiment'].value_counts()")
print(df['NLTK_Sentiment'].value_counts())

# Prepare data for machine learning
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize)
X = vectorizer.fit_transform(df['Clean_Description'])
y = df['NLTK_Sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=80)

# Train a Logistic Regression model

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(enumerate(class_weights))
print(f"class_weights_dict {class_weights_dict}")
# Apply to Logistic Regression
# model = LogisticRegression(class_weight=class_weights_dict)
model = LogisticRegression(C=0.001, penalty='l2', class_weight='balanced')

model.fit(X_train, y_train)

# Predict sentiments on the test set
predictions = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

以下是打印语句的结果：

NLTK_情绪
正8000
负2000
名称：计数，数据类型：int64

class_weights_dict {0：2.3696682464454977，1：0.6337135614702155}
精度：1.0000
精度：1.0000
召回率：1.0000
F1 分数：1.0000

我无法找到为什么我的模型总是给出 100% 准确率的原因。

Answer 1

您的 NLTK_Sentiment 列基于 Clean_Description 列的情绪。 X 列也基于 Clean_Description 列。

您实质上是在测试每个标记出现次数的计数与 VADER 分类之间是否存在线性关系。由于 VADER 的工作原理是为每个单词分配 -4 到 4 之间的分数，然后将它们相加，因此这是一种线性关系。（这并不完全正确 - VADER 能够识别一些习语，如“坏屁股”或否定词，如“不好”，但在这些特殊情况之外，它是线性的。）

因此，逻辑回归本质上只是恢复 VADER 中的词级权重。你给它一个简单的问题，这就是你得到如此高分的原因。

Logistic回归模型产生 100% 的准确度

问题描述投票：0回答：1

1个回答

最新问题

Logistic回归模型产生 100% 的准确度

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1