当我尝试使用管道来组合一对变压器时,第二个变压器(log)似乎没有被应用。
我试图简化日志转换器以执行简单的添加,但同样的问题仍然存在。
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class Impute(BaseEstimator, TransformerMixin):
def __init__(self, columns=None, value='mean'):
"""
columns: A list of columns to apply the imputation to.
value:
- "mean": Fills in missing values with mean of training data
- number: Fills in values with that number
- dictionary: Fills in values where dictionary keys are column names
"""
self.columns = columns
self.value = value
def fit(self, X, y=None):
if self.columns is None:
self.columns = X.columns
if isinstance(self.value, str):
if self.value == "mean":
self.value = X[self.columns].mean()
elif self.value == 'median':
self.value = X[self.columns].median()
return self
def transform(self, X):
X[self.columns] = X[self.columns].fillna(self.value)
return X
class Log(BaseEstimator, TransformerMixin):
def __init__(self, columns=None, offset_value=0):
"""
offset_value: a value to specify to handle invalid outputs such as log(0) or log(negative values)
"""
self.columns = columns
self.offset_value = offset_value
def fit(self, X, y=None):
return self
def transform(self, X):
X_new = X.copy()
X_new[self.columns] = np.log(X_new[self.columns] + self.offset_value)
return X_new
###########################
temp = pd.DataFrame([[590,3,None, "2018-01-01"],[0,2,3, "2018-01-01"],
[590,2,4, "2019-01-01"], [None ,None,4, "2018-01-01"],
[850 ,None,4, "2018-01-01"]], columns=["credit_score", "n_cats", "premium", "fix_date"])
print(temp)
impute = Impute(columns=["credit_score", "n_cats", "premium"], value="mean")
impute.fit(temp)
temp = impute.transform(temp)
log = Log(columns=["credit_score", "n_cats", "premium"], offset_value=1)
log.fit(temp)
temp = log.transform(temp)
temp
###########################
temp = pd.DataFrame([[590,3,None, "2018-01-01"],[0,2,3, "2018-01-01"],
[590,2,4, "2019-01-01"], [None ,None,4, "2018-01-01"],
[850 ,None,4, "2018-01-01"]], columns=["credit_score", "n_cats", "premium", "fix_date"])
print(temp)
impute = Impute(columns=["credit_score", "n_cats", "premium"], value="mean")
log = Log(columns=["credit_score", "n_cats", "premium"], offset_value=1)
steps = [("impute", impute),
("log", log)
]
pipe = Pipeline(steps)
pipe.fit(temp)
pipe.transform(temp)
temp
当变压器单独应用时,它显示:
credit_score n_cats premium fix_date
0 6.381816 1.386294 1.558145 2018-01-01
1 0.000000 1.098612 1.386294 2018-01-01
2 6.381816 1.098612 1.609438 2019-01-01
3 6.231465 1.203973 1.609438 2018-01-01
4 6.746412 1.203973 1.609438 2018-01-01
当我试图使用管道时,它显示
credit_score n_cats premium fix_date
0 590.0 3.000000 3.75 2018-01-01
1 0.0 2.000000 3.00 2018-01-01
2 590.0 2.000000 4.00 2019-01-01
3 507.5 2.333333 4.00 2018-01-01
4 850.0 2.333333 4.00 2018-01-01
问题是yourtransform
和Impute
类中Log
方法的实现差异。在Impute
中你修改了X
(没有复制),然后返回它。但是,在Log
中,您首先复制X
,对该副本进行修改,然后返回副本。
快速解决方法是查看正确答案的返回值:
pipe = Pipeline(steps)
pipe.fit(temp)
new_df = pipe.transform(temp)
一般来说,更好的做法是根本不修改原始的DataFrame
X
,只对其副本应用修改。这样,transform
方法总是返回一个全新的DataFrame
,你原来的DataFrame
保持完整。