如你所见,我这里有一个预处理函数并做了一些转换操作。我有一些分类变量,我将它们定义为 categorical_cols,并为它们使用 LabelEncoder。我的任务是保存 LabelEncoder 以供以后使用。 LabelEncoder工作正常,没有问题,
,
但是当我像这样保存 LabelEncoder 并尝试通过加载它在不同的预处理函数中使用它时;
---- LabelEncoder 保存端 ----
for column in categorical_cols:
label_encoder = LabelEncoder()
taken_df[column] = label_encoder.fit_transform(taken_df[column])
label_encoders[column] = label_encoder
with open('label_encoders.pkl', 'wb') as file:
pickle.dump(label_encoders, file)
----结束----
---- LabelEncoder 加载端 ----
categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
with open('label_encoders.pkl', 'rb') as file:
label_encoders = pickle.load(file)
for column in categorical_cols:
test_df[column] = label_encoders[column].fit_transform(test_df[column])
----结束----
一切都是一样的,使用的列甚至数据都是从原始数据集中选择来测试这个问题的。因此,我的问题是;
是否可以保存多列并像这样使用它,或者我应该保存每列pickle文件并单独使用它们?
其次,我该如何解决这个问题...
在这里你可以找到我的整个预处理功能;
def preprocessed_data(taken_df):
used_cols = [....]
taken_df = taken_df[used_cols]
taken_df["weight"] = taken_df["weight"].str.replace(",",".")
taken_df["weight"] = taken_df["weight"].astype(float)
taken_df.dropna(inplace=True)
# Dealing with datetime columns
taken_df["offer_date"] = pd.to_datetime(taken_df["offer_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["cargo_load_date"] = pd.to_datetime(taken_df["cargo_load_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["cargo_delivery_date"] = pd.to_datetime(taken_df["cargo_delivery_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
taken_df["vehicle_assignment_date"] = pd.to_datetime(taken_df["vehicle_assignment_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
vehicle_types = {
"(?i).*(Tir|Tır).*":"TIR",
"(?i).*(Kamyon)":"Kamyon"
}
taken_df.loc[:,"vehicle_type"] = taken_df.loc[:,"vehicle_type"].replace(vehicle_types,regex=True)
# Extract the categorical columns
categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
label_encoders = {}
for column in categorical_cols:
label_encoder = LabelEncoder()
taken_df[column] = label_encoder.fit_transform(taken_df[column])
label_encoders[column] = label_encoder
with open('label_encoders.pkl', 'wb') as file:
pickle.dump(label_encoders, file)
# Factor weights
weight_factor = 0.6
delivery_time_factor = 0.4
offer_date_factor = 0.2
# Convert offer date as UNIX timestamp
taken_df['offer_date'] = pd.to_datetime(taken_df['offer_date'])
epoch = dt.datetime(1970, 1, 1, tzinfo=pytz.UTC)
taken_df['unix_offer_date'] = (taken_df['offer_date'] - epoch).dt.total_seconds()
# Convert delivery date as UNIX timestamp
taken_df['cargo_delivery_date'] = pd.to_datetime(taken_df['cargo_delivery_date'])
taken_df['unix_delivery_time'] = (taken_df['cargo_delivery_date'] - epoch).dt.total_seconds()
# min max scaling for normalization
scaler = MinMaxScaler()
# normalizing the weight column
taken_df['normalized_weight'] = scaler.fit_transform(taken_df['weight'].values.reshape(-1, 1))
# normalization of UNIX timestamps
taken_df['normalized_offer_date'] = scaler.fit_transform(taken_df['unix_offer_date'].values.reshape(-1, 1))
taken_df['normalized_delivery_time'] = scaler.fit_transform(taken_df['unix_delivery_time'].values.reshape(-1, 1))
with open('scaler.pkl', 'wb') as f:
pickle.dump(scaler, f)
# Calculation of priority score
taken_df['priority_score'] = (weight_factor * taken_df['normalized_weight']) + (offer_date_factor * taken_df['normalized_offer_date']) + (delivery_time_factor * taken_df['normalized_delivery_time'])
return taken_df
我也试过这个方法,但没成功..
encoder = LabelEncoder()
for col in categorical_cols:
taken_df[col] = encoder.fit_transform(taken_df[col])
with open('encoder.pkl', 'wb') as f:
pickle.dump(encoder, f)
完成 fit_transform() 后,要使用相同的 LabelEncoder 复制它,您必须使用变换函数:
for column in categorical_cols:
test_df[column] = label_encoders[column].transform(test_df[column])
fit_transform() 改变编码器。