scikit-learn 中的逻辑回归预测的概率高于文本分类的预期

Question

我正在开发一项功能，旨在提示用户根据最近共享图像的对话附加图像。

如果用户尝试在类似的上下文中发送没有图像的消息，我已经使用

scikit-learn

实现了预测机制。但是，我遇到了一个问题，即预测消息“你好吗？”返回

0.66

，而理想情况下应大于

0.5

。

这是代码：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

messages = {'Did you receive the image yesterday?': 0,
            'I\'m going to send you a picture of my cat.': 1,
            'I\'m at the beach and I\'m taking a photo of the sunset.': 1,
            'I\'m going to send you a video of my dog playing fetch.': 1,
            'I\'m going to send you a screenshot of my computer screen.': 1,
            'Please find the attachment': 1, 'attached file': 1,
            'attached image': 1, 'attached': 1, 'yesterday?': 1, 'did you receive': 0}

pd_messages = pd.DataFrame({'text': messages.keys(),
                            'has_image': messages.values()})

features = pd_messages['text']
labels = pd_messages['has_image']

tfidf_vectorizer = TfidfVectorizer()
features_tfidf = tfidf_vectorizer.fit_transform(features)

model = LogisticRegression(solver='liblinear')
model.fit(features_tfidf, labels)

message = "how are you?"
message = re.sub(r'[^\w\s]', '', message.lower())
message_tfidf = tfidf_vectorizer.transform([message])

prediction = model.predict_proba(message_tfidf)[:, 1]

print(prediction)

即使将

'how are you?': 0

添加到消息字典后，预测仍然高于

0.5

。为什么会出现这种情况？

Answer 1

您的示例存在一些问题：

存在类别不平衡，这在使用基于回归的模型时通常不好。
训练数据集非常小。
在提取会话上下文时，TF-IDF 可能不是最佳选择。

可能的解决方案：

生成更多 0 类样本，直到您的训练集达到平衡，或使用少数过采样或多数欠采样等平衡技术。

# Add more examples with has_image=0
messages['how are you?'] = 0
messages['hello'] = 0

增加训练样本数量。
您可能想尝试词嵌入。

scikit-learn 中的逻辑回归预测的概率高于文本分类的预期

问题描述投票：0回答：1

1个回答

最新问题

scikit-learn 中的逻辑回归预测的概率高于文本分类的预期

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1