找到样本数不一致的输入变量:[2,144]

问题描述 投票:1回答:1

我的训练数据集由144个反馈组成,分别为72个阳性和72个阴性。分别有两个目标标签正面和负面。请考虑以下代码段:

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data) 
                     data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

我不明白问题是什么。请帮忙。

machine-learning scikit-learn
1个回答
1
投票

您没有使用计数向量器。这就是你现在拥有的:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

所以你看到你没有达到你想要的效果。你没有正确地改变每一行。你甚至没有训练计数矢量器,因为你使用整个DataFrame,而不仅仅是注释语料库。要解决这个问题,我们需要确保Count完成得很好:如果你这样做(使用正确的语料库):

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

你知道我们正在接近我们想要的东西。我们只需要将其正确转换(转换每一行):

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
   ...
       <1x23 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>], dtype=object)

我们有一个更合适的X!现在我们只需要检查一下是否可以拆分:

target = [1 if i<72 else 0 for i in range(8)] # The dataset is here of size 8 
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

它的工作原理!

您需要确保了解CountVectorizer以正确的方式使用它

© www.soinside.com 2019 - 2024. All rights reserved.