我有一些训练数据(TRAIN)和一些测试数据(TEST)。每个数据帧的每一行包含一个观察到的类(X)和一些二进制(Y)列。 BernoulliNB基于训练数据预测测试数据中给定Y的X的概率。我试图查找测试数据(Pr)中每行的观察类的概率。
编辑:我使用Antoine Zambelli的建议来修复代码:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
这似乎有效,给我结果(df_S):
这正确地为前两行提供“NaN”,因为训练数据不包含关于类X = 5或X = 0的信息。
好的,这里有几个问题。我在下面有一个完整的工作示例,但首先是那些问题。主要是断言“这正确地为前两行提供了”NaN“。
这与使用分类算法的方式以及它们可以做什么有关。训练数据包含您希望算法知道并能够采取行动的所有信息。测试数据仅在处理时考虑到该信息。即使您(该人)知道测试标签是5
并且未包含在训练数据中,该算法也不知道。它只会查看要素数据,然后尝试从这些数据中预测标签。所以它不能返回nan
(或5
,或任何不在训练集中) - nan
来自你的工作从df_R
到df_S
。
这导致了第二个问题,即行df_Te_Y = TEST .iloc[ : , 1 : ]
,该行应该是df_Te_Y = TEST .iloc[ : , 2 : ]
,因此它不包括标签数据。标签数据仅出现在训练集中。预测标签将仅从训练数据中出现的标签集中提取。
注意:我已经将类标签更改为Y
,并将特征数据更改为X
,因为这是文献中的标准。
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
如果在阅读代码后没有意义,我建议您查看有关聚类算法的一些教程或其他材料。