我有 16 个 csv 文件,每个文件包含大约 11250 行,其中包含 19 个特征和一列标签。我想将 Leave One Group Out 作为功能选择算法(如顺序前向选择和互信息)的交叉验证。我不知道如何实现这种交叉验证技术。以下代码如下:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.linear_model import LogisticRegression
from numpy import array
import pandas as pd
import matplotlib.pyplot as plt
import glob
#################
#Reading cvs file
#################
print("Reading multiple csv files")
print("")
csv_files = glob.iglob('D:/Project/csvfiles/*')
dataframe = pd.DataFrame()
print("Merging all the csv files")
print("")
#append all files together
for file in csv_files:
df_temp = pd.read_csv(file)
dataframe = dataframe.append(df_temp, ignore_index=True)
print("Dataframe is saved")
print("")
# dataframe.to_csv("Dataset.csv")
X = dataframe.drop(['Unnamed: 0', 'Labels'], axis=1)
y = dataframe['Labels']
#############################
# Class for Feature Selection
#############################
class FeatureSelection:
def __init__(self, dataframe, target):
self.dataframe = dataframe
self.target = target
###########################
# Normalizing the dataframe
###########################
def normalization(self):
print("Performing Normalization")
print("")
for column in self.dataframe.columns:
self.dataframe[column] = (self.dataframe[column]-self.dataframe[column].min()) / (self.dataframe[column].max() - self.dataframe[column].min())
print("Normalization Completed")
print("")
##############################
# Sequential Forward Selection
##############################
def sequential_forward_selection(self):
print("Performing Sequential Forward Selection")
print("")
sfs = SFS(rfc(n_jobs=-1),
k_features='best',
forward=True,
floating=False,
verbose=2,
scoring='accuracy', # sklearn classifiers
cv=5)
sfs = sfs.fit(self.dataframe, self.target)
print('Dictionary: ',sfs.get_metric_dict())
print('')
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
fig1 = plot_sfs(sfs.get_metric_dict(confidence_interval=0.95), kind='std_err')
plt.title('Sequential Forward Selection (with normalization)')
plt.grid()
plt.show()
print('Best Features {} with Index number {}:'.format(sfs.k_feature_names_, sfs.k_feature_idx_))
print('')
df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
print(df[["feature_idx","avg_score"]] )
print("Sequential Forward Selection completed")
print("")
####################
# Mutual Information
####################
def mutual_information(self):
print("Performing Mutual Information")
print("")
mic = SelectKBest(score_func=mutual_info_classif, k=15)
mic.fit(self.dataframe, self.target)
feature_MI_score = pd.Series(mic.scores_, index=self.dataframe.columns)
print(feature_MI_score.sort_values(ascending=False))
feature_MI_score.sort_values(ascending=False).plot.bar(figsize=(10, 8))
plt.show()
print("")
print("Mutual Information completed")
print("")
if __name__ == "__main__":
featureselection = FeatureSelection(X, y)
featureselection.normalization()
featureselection.sequential_forward_selection()
featureselection.mutual_information()
我的任务是使用 Leave One Group Out 作为交叉验证,它首先将 1 个数据集作为测试数据,其余 15 个作为训练数据。然后它应该继续前进,直到所有数据集都变成测试数据并从中找到最好的特征。任何帮助表示赞赏。