将其他数据合并到我的TFIDF阵列中

问题描述 投票:0回答:1

我正在尝试使用scikit-learn创建文本分类模型。刚开始,我只使用文本的tfidf数组作为功能。我的数据集的结构如下所示(数据集存储在名为df的熊猫数据框中):

>>>df.head(2)

       id_1    id_2    id_3    target    text
       11      454     320     197       some text here
       15      440     111     205       text goes here too

>>>df.info()

    Data columns (total 5 columns):
     #   Column    Non-Null Count   Dtype 
    ---  ------    --------------   ----- 
     0   id_1      500 non-null     uint16
     1   id_2      500 non-null     uint16
     2   id_3      500 non-null     uint16
     3   target    500 non-null     uint16
     4   text      500 non-null     object

因此,我拆分训练/测试数据集,然后继续创建tfidf向量并转换数据以进行训练和测试。

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

到目前为止,显然代码可以正常工作。但是,需要改进算法,包括另一个功能。为此,我想将id_1列添加到我的功能中(这可能是我们ML模型的重要信息)。因此,除了我的tfidf矩阵之外,我还想将此列(id_1)添加到我的新功能中,以便我可以将其作为参数传递来训练模型。

我尝试过的:

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)

vectorizer = TfidfVectorizer(max_features=500, decode_error="ignore", ngram_range=(1, 2))
vectorizer.fit(X_train)

X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)

所以,我的结构的形状是

>>>print(X_train_tfidf.shape)

(37, 500) # as expected (I'm loading 50 lines, so this is about 75%)

>>>print(X_train_all_features.shape)

(50, 501) # n of columns is expected, but not the lines, because the df[id_1] was not splited in train_test_split function

总的来说,我想将下面的图像传递给我的ML算法-我的tfidf向量和我的id_1特征:

tfidf concat id_1

我觉得我缺少了一些非常基础的东西,但是即使进行了所有研究,我仍然能够令人满意地解决我的问题。老实说,我在问题的那部分迷失了,我不知道如何从这里发展

python pandas scikit-learn nlp tf-idf
1个回答
0
投票

据我所知,您需要X_train_tfidf中观测值的索引,以便能够从df ['id_1']获得相应的值,因此不能简单地将整个df ['id_1' ]列添加到X_train_tfidf。尝试更换

X_train_all_features = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), df['id_1']], axis = 1)

通过以下代码:

X_train_all_features = X_train_tfidf.copy()
X_train_all_features['id_1'] = df.loc[X_train_tfidf.index.values, 'id_1']

让我知道是否可行。

© www.soinside.com 2019 - 2024. All rights reserved.