为管道创建变压器时出现问题

问题描述 投票:0回答:1

现在我正在尝试创建一个最初使用随机过采样的管道,我想使用的第二步是自定义异常值去除器,但我在执行该管道时遇到问题。

这是我的管道和所有过程的代码:

accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
rand_log_reg = RandomizedSearchCV(LogisticRegression(max_iter = 200), log_reg_params, n_iter=4)

for train, test in kf.split(Org_X_train, Org_y_train):
    X_train, X_test = Org_X_train.iloc[train], Org_X_train.iloc[test]
    y_train, y_test = Org_y_train.iloc[train], Org_y_train.iloc[test]
    print(X_train.index)
    pipeline = make_pipeline(RandomOverSampler(random_state=42), OutlierRemover(columns=['V14', 'V12', 'V10', 'V4', 'V11', 'V2']), rand_log_reg)

    print(X_train.index)
    model = pipeline.fit(X_train, y_train)
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(X_test)

    accuracy_lst.append(accuracy_score(y_test, prediction))
    precision_lst.append(precision_score(y_test, prediction))
    recall_lst.append(recall_score(y_test, prediction))
    f1_lst.append(f1_score(y_test, prediction))
    auc_lst.append(roc_auc_score(y_test, prediction))

print("Accuracy:", np.mean(accuracy_lst))
print("Precision:", np.mean(precision_lst))
print("Recall:", np.mean(recall_lst))
print("F1 Score:",  np.mean(f1_lst))
print("AUC Score:",  np.mean(auc_lst))

这就是 autlier 提取器的代码:

class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        self.X=X
        self.y=y
        return self

    def transform(self, X, y=None):
        new_X = X.copy()
        for col in self.columns:
            q25, q75 = np.percentile(new_X[col], 25), np.percentile(new_X[col], 75)
            iqr = q75 - q25
            cut_off = iqr * 1.5
            lower, upper = q25 - cut_off, q75 + cut_off
            indices_to_drop = new_X[(new_X[col] > upper) | (new_X[col] < lower)].index
            new_X = new_X.drop(indices_to_drop)
        if y is not None:
            new_y = y.drop(indices_to_drop)
            return new_X, new_y
        else:
            return new_X

错误“ValueError:发现样本数量不一致的输入变量:[310231, 363920]”,因为 X 减少了,但 y 不,我尝试了不同的方法,但没有任何效果。

python scikit-learn pipeline outliers k-fold
1个回答
0
投票

如果你想修改

imblearn
,你需要使用(类似)
y
的管道。看起来你一定已经是这样了,因为你使用了过采样。

那么你的异常值去除只需要遵守

imblearn
标准:你应该定义一个
resample
方法而不是
transform
,返回
X
y

© www.soinside.com 2019 - 2024. All rights reserved.