我正在开发一项功能,旨在提示用户根据最近共享图像的对话附加图像。
如果用户尝试在类似的上下文中发送没有图像的消息,我已经使用
scikit-learn
实现了预测机制。但是,我遇到了一个问题,即预测消息“你好吗?”返回 0.66
,而理想情况下应大于 0.5
。
这是代码:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
messages = {'Did you receive the image yesterday?': 0,
'I\'m going to send you a picture of my cat.': 1,
'I\'m at the beach and I\'m taking a photo of the sunset.': 1,
'I\'m going to send you a video of my dog playing fetch.': 1,
'I\'m going to send you a screenshot of my computer screen.': 1,
'Please find the attachment': 1, 'attached file': 1,
'attached image': 1, 'attached': 1, 'yesterday?': 1, 'did you receive': 0}
pd_messages = pd.DataFrame({'text': messages.keys(),
'has_image': messages.values()})
features = pd_messages['text']
labels = pd_messages['has_image']
tfidf_vectorizer = TfidfVectorizer()
features_tfidf = tfidf_vectorizer.fit_transform(features)
model = LogisticRegression(solver='liblinear')
model.fit(features_tfidf, labels)
message = "how are you?"
message = re.sub(r'[^\w\s]', '', message.lower())
message_tfidf = tfidf_vectorizer.transform([message])
prediction = model.predict_proba(message_tfidf)[:, 1]
print(prediction)
即使将
'how are you?': 0
添加到消息字典后,预测仍然高于 0.5
。为什么会出现这种情况?
您的示例存在一些问题:
可能的解决方案:
# Add more examples with has_image=0
messages['how are you?'] = 0
messages['hello'] = 0