我正在尝试使用scikit-learn创建文本分类模型。刚开始,我只使用文本的tfidf数组作为功能。我的数据集的结构如下所示(数据集存储在名为df
的熊猫数据框中):
>>>df.head(2)
id_1 id_2 id_3 target text
11 454 320 197 some text here
15 440 111 205 text goes here too
>>>df.info()
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id_1 500 non-null uint16
1 id_2 500 non-null uint16
2 id_3 500 non-null uint16
3 target 500 non-null uint16
4 text 500 non-null object
因此,我拆分训练/测试数据集,然后继续创建tfidf向量并转换数据以进行训练和测试。
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)
vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
到目前为止,显然代码可以正常工作。但是,需要改进算法,包括另一个功能。为此,我想将id_1
列添加到我的功能中(这可能是我们ML模型的重要信息)。因此,除了我的tfidf矩阵之外,我还想将此列(id_1
)添加到我的新功能中,以便我可以将其作为参数传递来训练模型。
我尝试过的:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)
vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)
所以,我的结构的形状是
>>>print(X_train_tfidf.shape)
(37, 500) # as expected (I'm loading 50 lines, so this is about 75%)
>>>print(X_train_all_features.shape)
(50, 501) # n of columns is expected, but not the lines, because the df[id_1] was not splited in train_test_split function
总的来说,我想将下面的图像传递给我的ML算法-我的tfidf向量和我的id_1
特征:
我觉得我缺少了一些非常基础的东西,但是即使进行了所有研究,我仍然能够令人满意地解决我的问题。老实说,我在问题的那部分迷失了,我不知道如何从这里发展
据我所知,您需要X_train_tfidf中观测值的索引,以便能够从df ['id_1']获得相应的值,因此不能简单地将整个df ['id_1' ]列添加到X_train_tfidf。尝试更换
X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)
通过以下代码:
X_train_all_features = X_train_tfidf.copy()
X_train_all_features['id_1'] = df.loc[X_train_tfidf.index.values, 'id_1']
让我知道是否可行。