在文本分类的任务中,如何编写管道的特征提取器类?

问题描述 投票:0回答:1

我在做一个文本作者归属模型,分类器是SVM(线性内核),我想使用sklearn.model_selection中的cross_val_score进行评估。

问题是如何通过管道将不同的特征(主要是自定义的,而不是来自于库中的变换器)反馈给分类器,以训练分类器考虑所有的特征(如平均句子长度,标点符号的频率,词汇丰富度等)。

这个标准库变换器tf-idf的代码非常好用。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.pipeline import Pipeline

from sklearn import preprocessing

# example of data
data = [['Anton', "The revival of the 2015 festival ushered in live music from iconic Filipino talents such as Barbie Almalbis, Kevin Toy’s, and Hilera, which had the beaches of San Juan flowing with good vibes."],
        ['Anton', "Tip: For a hassle-free experience, make sure to pre-book online with Biyaheroes.com, which makes public transport so much easier, even for first-time commuters. With their real-time seat and schedule selectors, commuters get a very useful overview of their trip schedules so they can plan ahead."],
        ['Anton', "Hungry surfers and sun worshipers can easily walk along the beach and on the parallel road, where lanes of restaurants offer a wide array of cuisines. There are also a number of cafes and food stalls to choose from."],
        ['Brendan', 'Today, I’m back here again, and again reminded of what makes Alberta such a brilliant place to travel: its diversity.  I left the edge of the snow-covered Rocky Mountains in the morning, and by midday I’m here in the dry heat of the desert and prairies looking down on a valley of still water and stone figures.'],
        ['Brendan', "A life in which I spent my nights sipping exotic drinks, nibbling on strange foods, and diving head first into the local night life. All at the same time I feel scared. But unlike most people this is the part I love. I love being scared, because travel has taught when you’re scared you’re probably about to embark on something incredible."],
        ['Brendan', "Of the 44 kilometers of trail, about 25 of those take hikers above the treeline.  And well the trail isn’t exactly super challenging, most of it is fairly flat aside from a couple sections, it does take you to parts of the mountains that usually require extreme hikes to get to."],
        ['Dave', 'If anyone has a fun personality and wants to start living abroad, I’d definitely recommend applying to be  tour guide around Europe!'],
        ['Dave', 'I found myself a decent job, a great shared house to live in, and had an amazing crew to hang out with every weekend.  I was no longer a nomad.  Sydney became more than just another travel destination, it became my second home.'],
        ['Dave', "I immediately fell in love with the long-term backpacking culture, the budget travel options in South-East Asia, and treating the world as my classroom.  Traveling during your twenties is so important, and I’m so happy I figured out this was an option!"],
        ['Derek', "The other day I received an email from a reader asking me to confirm the proper way to bargain in foreign countries. The ‘proper way’ that was mentioned is something that I’ve heard from travelers all the time. It’s the 50% rule. And to me, the rule is wrong."],
        ['Derek', "If you see something you want to purchase, visit 2-3 other shops nearby that sell the same thing or something similar. Ask how much it costs at each of the shops. This will give you a general idea of a true starting price for negotiations. If one shop quotes you $50, another quotes $35 and another one quotes you $20, you know the actual price is below $20."],
        ['Derek', "As travel becomes more and more popular and commonplace though, such tourist crowds seem to be the norm all over the world. Walking down the street in many destinations requires a lot of focus in order to avoid bumping into strollers, lost tourists and group leaders that don’t seem to mind taking over the sidewalks."]]

df = pd.DataFrame(data, columns = ['author', 'text'])

# define data set
X = df['text']

# define labels set; transform non-numerical labels to numerical labels
labelEncoder = preprocessing.LabelEncoder()
y = labelEncoder \
       .fit(df['author'].unique()) \
       .transform(df['author'].values)

# create pipeline
pipeline = Pipeline([
       ('tf_idf', TfidfVectorizer()),
       ('classifier', svm.SVC(kernel='linear'))
])

# cross-validation
scores_pipe = cross_val_score(pipeline, X, y, scoring='accuracy', cv=2)
mean_pipe_score = scores_pipe.mean()
print("Accuracy for tf-idf:", mean_pipe_score)

当我试图创建一个自定义的变换器类时,问题就来了(使用了来自于 "D "的例子)。此处). 我得到了警告和准确度=纳米。

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# example of data
data = [['Anton', "The revival of the 2015 festival ushered in live music from iconic Filipino talents such as Barbie Almalbis, Kevin Toy’s, and Hilera, which had the beaches of San Juan flowing with good vibes."],
        ['Anton', "Tip: For a hassle-free experience, make sure to pre-book online with Biyaheroes.com, which makes public transport so much easier, even for first-time commuters. With their real-time seat and schedule selectors, commuters get a very useful overview of their trip schedules so they can plan ahead."],
        ['Anton', "Hungry surfers and sun worshipers can easily walk along the beach and on the parallel road, where lanes of restaurants offer a wide array of cuisines. There are also a number of cafes and food stalls to choose from."],
        ['Brendan', 'Today, I’m back here again, and again reminded of what makes Alberta such a brilliant place to travel: its diversity.  I left the edge of the snow-covered Rocky Mountains in the morning, and by midday I’m here in the dry heat of the desert and prairies looking down on a valley of still water and stone figures.'],
        ['Brendan', "A life in which I spent my nights sipping exotic drinks, nibbling on strange foods, and diving head first into the local night life. All at the same time I feel scared. But unlike most people this is the part I love. I love being scared, because travel has taught when you’re scared you’re probably about to embark on something incredible."],
        ['Brendan', "Of the 44 kilometers of trail, about 25 of those take hikers above the treeline.  And well the trail isn’t exactly super challenging, most of it is fairly flat aside from a couple sections, it does take you to parts of the mountains that usually require extreme hikes to get to."],
        ['Dave', 'If anyone has a fun personality and wants to start living abroad, I’d definitely recommend applying to be  tour guide around Europe!'],
        ['Dave', 'I found myself a decent job, a great shared house to live in, and had an amazing crew to hang out with every weekend.  I was no longer a nomad.  Sydney became more than just another travel destination, it became my second home.'],
        ['Dave', "I immediately fell in love with the long-term backpacking culture, the budget travel options in South-East Asia, and treating the world as my classroom.  Traveling during your twenties is so important, and I’m so happy I figured out this was an option!"],
        ['Derek', "The other day I received an email from a reader asking me to confirm the proper way to bargain in foreign countries. The ‘proper way’ that was mentioned is something that I’ve heard from travelers all the time. It’s the 50% rule. And to me, the rule is wrong."],
        ['Derek', "If you see something you want to purchase, visit 2-3 other shops nearby that sell the same thing or something similar. Ask how much it costs at each of the shops. This will give you a general idea of a true starting price for negotiations. If one shop quotes you $50, another quotes $35 and another one quotes you $20, you know the actual price is below $20."],
        ['Derek', "As travel becomes more and more popular and commonplace though, such tourist crowds seem to be the norm all over the world. Walking down the street in many destinations requires a lot of focus in order to avoid bumping into strollers, lost tourists and group leaders that don’t seem to mind taking over the sidewalks."]]

df = pd.DataFrame(data, columns = ['author', 'text'])

# define data set
X = df['text']

# define labels set; transform non-numerical labels to numerical labels
labelEncoder = preprocessing.LabelEncoder()
y = labelEncoder \
       .fit(df['author'].unique()) \
       .transform(df['author'].values)

# extracts given columns from df
class ColumnSelector(BaseEstimator, TransformerMixin):

       def __init__( self, feature_names ):
              self._feature_names = feature_names

       def fit( self, X, y = None ):
              return self

       def transform( self, X, y = None ):
              return X[self._feature_names]

# counts the frequency of ! among all the chars
def count_exclamationMark(text):
       counter = 0

       for char in text:
              if char == "!":
                     counter +=1
       return counter / len(text)

# Transforming column of text data into frequencies of !
class ExclamationTransformer(BaseEstimator,TransformerMixin):
       #Class Constructor
       def __init__(self, exclamation = True):
              self._exclamation = exclamation

       #Return self, nothing else to do here
       def fit( self, X, y = None ):
              return self

              #Custom transform method we wrote that creates aformentioned features and drops redundant ones
       def transform(self, X, y = None):
              #Check if needed
              if self._exclamation:
                     #create new column

                     X['exclamations'] = X['text'].apply(count_exclamationMark)
                     X = X.drop('text', axis = 1)
              #Converting any infinity values in the dataset to Nan
              X = X.replace([ np.inf, -np.inf ], np.nan)
              #returns a numpy array
              return X.values


# When I implement these classes manually in train-and-test approach, everythin works
columns = ['text']
selector = ColumnSelector(columns)
a = selector.transform(df)

exclamator = ExclamationTransformer(exclamation=1)
b = exclamator.transform(a)

X_train, X_test, y_train, y_test = train_test_split(b, y, test_size=0.3,random_state=1, stratify=y)

clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

print("Accuracy for exclamations: ",accuracy_score(y_test, predictions))
# Output: Accuracy for exclamations:  0.25


pipeline = Pipeline([
       ('text_extraction',ColumnSelector(columns)),
       ('exclamations', ExclamationTransformer(exclamation=1)),
       ('classifier', svm.SVC(kernel='linear'))
])

# When it comes to this part I get a warning and an error listed bellow
scores_pipe = cross_val_score(pipeline, df, y, scoring='accuracy', cv=2)

mean_pipe_score = scores_pipe.mean()

print("Accuracy for exclamations:", mean_pipe_score)
#Output: Accuracy for exclamations: nan

警告信息。

FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
KeyError: None
  FitFailedWarning)

我花了好几个小时,但还是不知道出了什么问题,也不知道如何将自定义特征输入到管道中,更不用说将多个自定义特征与典型的矢量器相结合了。有谁知道为什么会出现这种情况,或者如何解决?

python pipeline cross-validation feature-extraction text-classification
1个回答
0
投票

在(BaseEstimator,TransformerMixin)类的方法和(BaseEstimator,TransformerMixin)类本身的属性中应该使用相同的属性名。比如说

class ColumnSelector(BaseEstimator, TransformerMixin):

   def __init__(self, feature_names):
       self.feature_names = feature_names

   def fit(self, X, y=None):
       return self

   def transform(self, X, y=None):
       return X[self.feature_names]
© www.soinside.com 2019 - 2024. All rights reserved.