我正在估计数据不足和不平衡的机票销售情况。为了解决这个问题,我使用了smogn包中的smoter(用于回归的smote)。但是每次我运行模型时,我都会对目标有不同的预测。我认为吸烟者每次都会产生不同的输出数据。有什么办法可以解决这个随机状态?
请指导我在这里可以做什么,下面是代码段。
import smogn
def solution(df, p_bar: bool = 1, params: dict = model_params):
# sort
try:
df = df.sort_values(["transition_date", "event_date"], ascending=False)
except Exception as e:
print("e")
# bootstraping hyperparams
n_samples = 140
n_range = 40
# tqdm
if p_bar == 1:
rg = tqdm(range(len(df)))
else:
rg = range(len(df))
pred_list = []
time.sleep(0.5)
try:
for i in rg:
time.sleep(0.1)
test_tour_id = df.iloc[i]['tour_id']
df_without_test_tour = df[df['tour_id'] != test_tour_id].reset_index(drop=True)
dt = smogn.smoter(
data=df_without_test_tour,
y='total_sales',
k=3,
samp_method='extreme',
rel_thres=0.8,
rel_method='auto',
rel_xtrm_type='high',
rel_coef=2.25
)
test_data = df.iloc[[i]].drop(e_col, axis=1)
test_label = [df.iloc[i]['total_sales']]
test_prid_id = df.iloc[i ]['promotion_id']
train_data = dt.drop(e_col, axis=1)
train_label = dt['total_sales']
pred_tmp = []
for j in range(n_range):
x_train = resample(train_data, n_samples=n_samples, random_state=j)
y_train = resample(train_label, n_samples=n_samples, random_state=j)
model = xgb.get_model(x_train, y_train, params)
pred = model.predict(test_data)
pred_tmp.append(pred)
pred = np.mean(pred_tmp, axis=0)
mape_pred = abs(test_label - pred) * 100 / pred
mape_real = abs(test_label - pred) * 100 / test_label
pred_list.append([test_prid_id, pred[0], mape_pred[0], mape_real[0]])
except Exception as ex:
print(ex)
tqdm._instances.clear() if p_bar == 1 else None
pred = pd.DataFrame(pred_list, columns=['promotion_id', 'pred', 'mape_pred', 'mape_real'])
return pd.merge(pred, df[e_col], how='left')
事实证明,Smote Regress
在选择最近邻居时有一些随机性:
在此处查看其代码中的代码行:here
尽管我假设您正在使用Nick Kunz's Repository中的python版本,但我建议您使用here中的R one。如果您将Python用于机器学习项目,请考虑使用rpy2
python模块以便在R和Python之间进行通信