尝试在 Python 中构建决策树 ML 并在正确输出和 excel 导出方面遇到困难

问题描述 投票:0回答:0

我创建了一个程序,它根据训练集(我知道目标变量的训练集)拆分我的数据,然后进行验证和测试。显然,我的目标是运行训练数据并对其进行优化以生成具有高 AUC 分数的准确模型。

我提前道歉,因为我真的不知道我真正的挣扎在哪里,因为这是我第一次在 Python 中这样做,所以任何帮助正确使用带有决策树的 sklearn 和测试 AUC 分数的帮助将不胜感激:

这是我到目前为止构建的内容:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score

df = pd.read_csv("all_data.csv")

# impute
imp = SimpleImputer(strategy='most_frequent')
df = pd.DataFrame(imp.fit_transform(df), columns=df.columns)

# non-numeric -> numeric
le = LabelEncoder()
for column in df.columns:
    if df[column].dtype == object:
        df[column] = le.fit_transform(df[column])

# split data
train_df = df[0:7000] #split into train which contains my target variable data
val_df = df[7000:8500]
test_df = df[8500:]

#X and y
X = train_df.drop("Click", axis=1)
y = train_df['Click']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# decision tree
dt_params = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
dt = DecisionTreeClassifier()
dt_grid = GridSearchCV(dt, dt_params, cv=5, scoring='roc_auc')
dt_grid.fit(X_train, y_train)
dt_best = dt_grid.best_estimator_

# Predict on test data and calculate AUC scores
dt_pred_proba = dt_best.predict_proba(X_test)[:, 1]
print("\nDecision Tree Positive Probs:", dt_pred_proba)
dt_auc_score = roc_auc_score(y_test, dt_pred_proba)
print("Decision Tree AUC score:", dt_auc_score)

#export with correct identifiers ([7000:8500] and their matching probabilities? 

我得到一个输出 AUC 分数,但它不是最佳的,也无法检查它是否正确执行,因为我正在努力将它导出到 excel 以确认每个输出与其正确的标识符(7000、7001 等)匹配 `

python pandas machine-learning scikit-learn decision-tree
© www.soinside.com 2019 - 2024. All rights reserved.