如何在TfidfVectorizer.fit_transform()中传递用户定义的函数

问题描述 投票:3回答:1

我有文本预处理功能,只需删除停用词:

def text_preprocessing():
    df['text'] = df['text'].apply(word_tokenize)
    df['text']=df['text'].apply(lambda x: [item for item in x if item not in stopwords])
    new_array=[]
    for keywords in df['text']: #converts list of words into string
         P=" ".join(str(x) for x in keywords)
         new_array.append(P)
    df['text'] = new_array
    return df['text']

我想将text_preprocessing()传递给另一个函数tf_idf(),它给出了我基本上做的特征矩阵: -

def tf_idf():
    tfidf = TfidfVectorizer()
    feature_array = tfidf.fit_transform(text_preprocessing)
    keywords_data=pd.DataFrame(feature_array.toarray(), columns=tfidf.get_feature_names())
    return keywords_data

我得到了TypeError: 'function' object is not iterable的错误

python-3.x pandas user-defined-functions tfidfvectorizer natural-language-processing
1个回答
0
投票

您可以简单地将自定义停用词列表传递给TfidfVectorizer,而不是为停用词删除构建其他功能。正如您在下面的示例中所看到的,“test”已成功排除在Tfidf词汇表之外。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Setting up
numbers = np.random.randint(1, 5, 3)
text = ['This is a test.', 'Is this working?', "Let's see."]
df = pd.DataFrame({'text': text, 'numbers': numbers})

# Define custom stop words and instantiate TfidfVectorizer with them
my_stopwords = ['test'] # the list can be longer
tfidf = TfidfVectorizer(stop_words=my_stopwords)
text_tfidf = tfidf.fit_transform(df['text'])

# Optional - concatenating tfidf with df
df_tfidf = pd.DataFrame(text_tfidf.toarray(), columns=tfidf.get_feature_names())
df = pd.concat([df, df_tfidf], axis=1)

# Initial df
df
Out[133]: 
   numbers              text
0        2   This is a test.
1        4  Is this working?
2        3        Let's see.

tfidf.vocabulary_
Out[134]: {'this': 3, 'is': 0, 'working': 4, 'let': 1, 'see': 2}

# Final df
df
Out[136]: 
   numbers              text        is       let       see      this   working
0        2   This is a test.  0.707107  0.000000  0.000000  0.707107  0.000000
1        4  Is this working?  0.517856  0.000000  0.000000  0.517856  0.680919
2        3        Let's see.  0.000000  0.707107  0.707107  0.000000  0.000000
© www.soinside.com 2019 - 2024. All rights reserved.