如何在需要pd.get_dummies的新数据上运行模型

Question

我有运行以下模型的模型：

import pandas as pd
import numpy as np

# initialize list of lists 
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
       , ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
       , ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category']) 

print(df.head(2))
  Name  Attempts  Score Category
0  tom        10      1        a
1  tom        15      5        a

然后，我使用以下代码创建了一个虚拟df以在模型中使用：

from sklearn.linear_model import LogisticRegression

df_dum = pd.get_dummies(df)
print(df_dum.head(2))
  Attempts  Score  Name_matt  Name_tom  Category_a  Category_b  Category_c
0        10      1          0         1           1           0           0
1        15      5          0         1           1           0           0

然后我创建了以下模型：

#Model

X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values

#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]


#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)


#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)

df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])

print(dfpredictions)
    Name  Attempts  Score Category  predictions
14  matt        18      1        a          1.0
15  matt        15      6        a          1.0
16  matt        17      3        a          1.0
17  matt        14      7        c          1.0
18  matt        16      6        b          1.0
19  matt        10      2        b          1.0

现在我有新的数据，我想预测：

newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]

newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category']) 

print(newdf)

 Name  Attempts Category
0  tom        10        a
1  tom        15        a
2  tom        14        a

然后创建假人并进行运行预测

newpredict = pd.get_dummies(newdf)

predict = model.predict(newpredict)

输出：

ValueError: X has 3 features per sample; expecting 6

这是有道理的，因为没有类别b和c，也没有名为matt的名称。

我的问题是，鉴于我的新数据不会总是拥有原始数据中使用的全部列，设置此模型的最佳方法是什么？每天我都有新数据，因此我不太确定最有效且无错误的方法。

这是示例数据-运行pd.get_dummies时，我的数据集有2000列。非常感谢！

Answer 1

让我更详细地解释Nicolas和BlueSkyz的建议。

pd.get_dummies很有用，当您确定生产/新数据集中特定分类变量不会有任何新类别时，例如基于公司或数据库的内部数据分类/一致性规则的性别，产品等。

但是，对于大多数机器学习任务，您可以期望它们将来会有模型训练中未使用的新类别，sklearn.OneHotEncoder应该是标准选择。可以将handle_unknown的sklearn.OneHotEncoder参数设置为'ignore'来做到这一点：将来在应用编码器时忽略新类别。从documentation：

是否引发错误或忽略变换过程中是否存在未知分类特征（默认为引发）。如果将此参数设置为“忽略”，并且在转换过程中遇到未知类别，则此功能生成的一键编码列将全为零。在逆变换中，未知类别将表示为None

您的示例基于LabelEncoding和OneHotEncoding的完整流程如下：

# Create a categorical boolean mask categorical_feature_mask = df.dtypes == object # Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns categorical_cols = df.columns[categorical_feature_mask].tolist() # Instantiate the OneHotEncoder Object from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder(handle_unknown='ignore', sparse = False) # Apply ohe on data ohe.fit(df[categorical_cols]) cat_ohe = ohe.transform(df[categorical_cols]) #Create a Pandas DataFrame of the hot encoded column ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols)) #concat with original data and drop original columns df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1) # The following code is for your newdf after training and testing on original df # Apply ohe on newdf cat_ohe_new = ohe.transform(newdf[categorical_cols]) #Create a Pandas DataFrame of the hot encoded column ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols)) #concat with original data and drop original columns df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1) # predict on df_ohe_new predict = model.predict(df_ohe_new)

输出（您可以将其分配回newdf）：array([1, 1, 1])

但是，如果您确实只想使用pd.get_dummies，则以下内容也可以使用：newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)

上面的代码片段将确保新假人df（newpredict）中的列与原始df_dum中的列相同，将'Score'列放下并用NaN填充新列中的0。这里的输出与上面相同。

编辑：我忘记添加的一件事是pd.get_dummies通常比sklearn.OneHotEncoder

执行起来快得多

如何在需要pd.get_dummies的新数据上运行模型

问题描述投票：3回答：1

1个回答

最新问题

如何在需要pd.get_dummies的新数据上运行模型

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1