虽然下面的代码“有效”(因为它不会给出错误),但我得到了非常高的 AUC,这让我想知道它是否以某种方式跳过了我试图使其进行的实际交叉验证类型。
每组表示来自给定参与者的数据集合。因此,在每一次折叠中,都会保留一个参与者的所有数据进行测试,然后根据其余参与者的所有数据创建一个模型,然后对剩下的参与者的数据进行测试。我尝试进行洗牌,因为所有参与者的任务顺序都是相同的。我也做了标准化。我正在使用 scikit-learn 库。
是否存在不正确的情况(或此处增加过度拟合的情况)?这是LOGO的实际实现方式吗?我的数据(如特征)没有目标和任务编号。
第二个问题:如果我使用不同的“评分”参数多次运行模型,模型是否会过度训练等(请参阅代码)。我正在使用 cross_validate 函数,尽管我见过同时获取多个指标(auc、准确性等)的示例,但由于某种原因,该代码对我不起作用。如果我多次运行代码块来更改评分部分以获得我需要的不同值,可以吗?
import numpy as np
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.ensemble import GradientBoostingClassifier
# First, setting a random seed for the code to be reproducible
random_seed = 200
np.random.seed(random_seed)
#Defining our variables, using our previously created dictionary
X = data_dict['data']
Y = data_dict['target']
groups = data_dict['participants']
#Calling the LeaveOneGroupOut function from scikit
logo = LeaveOneGroupOut()
# Within each participant's group, shuffling the order of tasks
X_shuffled = []
Y_shuffled = []
groups_shuffled = []
unique_groups = np.unique(groups) #Each group representing a participant
for group in unique_groups:
group_indices = np.where(groups == group)[0]
shuffled_indices = np.random.permutation(group_indices)
X_shuffled.extend(X[shuffled_indices])
Y_shuffled.extend(Y[shuffled_indices])
groups_shuffled.extend(groups[shuffled_indices])
X_shuffled = np.array(X_shuffled)
Y_shuffled = np.array(Y_shuffled)
groups_shuffled = np.array(groups_shuffled)
# Creating the RandomForestClassifier
clf = GradientBoostingClassifier(random_state=random_seed)
# The pipeline first imputes missing data using the mean of each feature, then scales/normalizes features, and then implements
#the random forest classifier
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()), # Apply normalization
('clf', clf)
])
# Conducting the cross-validation with shuffling of task order within each participant's group
results_logo = cross_validate(pipeline, X_shuffled, Y_shuffled, cv=logo.split(X_shuffled, Y_shuffled, groups_shuffled),
scoring='roc_auc', return_train_score=True, return_estimator=True)
print('auc')
print('training score: %.4f' % results_logo['train_score'].mean())
print('test score: %.4f' % results_logo['test_score'].mean())
print(results_logo['test_score'])
print(np.mean(results_logo['test_score']))
for i, (train_index, test_index) in enumerate(logo.split(X_shuffled, Y_shuffled, groups_shuffled)):
print(f"Fold {i}:")
print(f" Train: index={train_index}, group={groups_shuffled[train_index]}")
print(f" Test: index={test_index}, group={groups_shuffled[test_index]}")
关于你的第二个问题:
如果我使用不同的“评分”参数多次运行模型,模型是否会过度训练等
不。评分指标是在给定的拟合模型上计算的,因此您可以在同一拟合模型上计算准确性和召回率。您不会影响模型的系数,这些系数已经在拟合模型期间计算出来,因此使用多行代码(每行计算不同的指标)没有问题。