如何使用TfidfVectorizer应用Kfold?

问题描述 投票:1回答:1

我在与Tfidf进行K折交叉验证时遇到问题。它给了我这个错误

ValueError: setting an array element with a sequence.

我看过其他有相同问题的问题,但他们使用的是train_test_split(),K折叠有些不同

for train_fold, valid_fold in kf.split(reviews_p1):
    vec = TfidfVectorizer(ngram_range=(1,1))
    reviews_p1 = vec.fit_transform(reviews_p1)

    train_x = [reviews_p1[i] for i in train_fold]        # Extract train data with train indices
    train_y = [labels_p1[i] for i in train_fold]        # Extract train data with train indices

    valid_x = [reviews_p1[i] for i in valid_fold]        # Extract valid data with cv indices
    valid_y = [labels_p1[i] for i in valid_fold]        # Extract valid data with cv indices

    svc = LinearSVC()
    model = svc.fit(X = train_x, y = train_y) # We fit the model with the fold train data
    y_pred = model.predict(valid_x)

实际上,我找到了问题所在,但我找不到解决的方法,基本上,当我们使用cv / train索引提取训练数据时,会得到一个稀疏矩阵的列表

[<1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 54 stored elements in Compressed Sparse Row format>,
 <1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 47 stored elements in Compressed Sparse Row format>,
 <1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 18 stored elements in Compressed Sparse Row format>, ....]

我尝试在分割后将Tfidf应用于数据,但是由于功能数量不同,因此无法正常工作。

因此,有没有办法在不创建稀疏矩阵列表的情况下将数据拆分为K折?

machine-learning data-science tf-idf tfidfvectorizer k-fold
1个回答
1
投票

他们回答类似的问题Do I use the same Tfidf vocabulary in k-fold cross_validation

for train_index, test_index in kf.split(data_x, data_y):
   x_train, x_test = data_x[train_index], data_x[test_index]
   y_train, y_test = data_y[train_index], data_y[test_index]

   tfidf = TfidfVectorizer()
   x_train = tfidf.fit_transform(x_train)
   x_test = tfidf.transform(x_test)

   clf = SVC()
   clf.fit(x_train, y_train)
   y_pred = clf.predict(x_test)
   score = accuracy_score(y_test, y_pred)
   print(score)
© www.soinside.com 2019 - 2024. All rights reserved.