GridSearchCV + StratifiedKfold，如果是TFIDF，则是

Question

我正在研究一个分类问题，需要预测文本数据的类别。我需要为要使用GridSearchCV的分类模型做超参数调整。我还需要执行StratifiedKFold，因为我的数据不平衡。我知道以下事实：如果我们有多类分类，则GridSearchCV内部使用StratifiedKFold。

我读过here，在TfidfVectorizer的情况下，我们将fit_transform应用于训练数据，而仅转换为测试数据。

这是我在下面使用StratifiedKFold完成的操作。

skf = StratifiedKFold(n_splits=5, random_state=5)

for train_index, test_index in skf.split(X, y):
    iteration = iteration+1
    print(f"Iteration number {iteration}")
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    train_tfid = tfidf_vectorizer.fit_transform(X_train.values.astype('U'))
    test_tfid = tfidf_vectorizer.transform(X_test.values.astype('U'))

    svc_model = linear_model.SGDClassifier()
    svc_model.fit(train_tfid, y_train.values.ravel())

我获得的精度/ f1不好，所以考虑使用GridSearchCV进行超参数调整。在GridSearchCV中，我们执行

c_space = np.logspace(-5, 8, 15) 
param_grid = {'C': c_space} 

# Instantiating logistic regression classifier 
logreg = LogisticRegression() 

# Instantiating the GridSearchCV object 
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5) 

logreg_cv.fit(X, y)

根据我的说法，logreg_cv.fit(X, y)会将X分为X_train，X_test内部k次，然后进行预测以提供最佳估计量。

就我而言，X应该是什么？如果是在fit_transform之后生成的X，然后在内部将X拆分为训练并进行测试时，则测试数据已经过fit_transform，但理想情况下应该只进行转换。

我担心的是，在我的情况下，如何在GridSearchCV内部控制fit_transform仅应用于训练数据，而将变换应用于测试数据（验证数据）。

因为如果在内部将fit_transform应用于整个数据，那么它不是一个好习惯。

Answer 1

这是在Pipeline中使用GridSearchCV的确切需求。首先，使用所需步骤（例如数据预处理，特征选择和模型）创建管道。一旦在此管道上调用GridSearchCV，它将仅在训练折叠上进行数据处理，然后与模型拟合。

阅读here，以了解有关sklearn中模型选择模块的更多信息。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=cats)
X, y = newsgroups_train.data, newsgroups_train.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y)


my_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])


parameters = {'clf__C': np.logspace(-5, 8, 15)}

grid_search = GridSearchCV(my_pipeline, param_grid=parameters,
                           cv=10, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
# {'clf__C': 0.4393970560760795}

grid_search.score(X_test, y_test)
# 0.8981481481481481

GridSearchCV + StratifiedKfold，如果是TFIDF，则是

问题描述投票：1回答：1

1个回答

最新问题

GridSearchCV + StratifiedKfold，如果是TFIDF，则是

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1