[我有一个如下所示的数据框。我在Price列上应用了MinMaxScaler,并使用了inverse_transform函数来获取原始价格值,但结果有误。请在此方面建议我。
DataFrame:
Date Customer Price
1/6/2019 A 142404534.13
1/7/2019 A 989.34
1/8/2019 A 45444.57
1/9/2019 A 574343.10
1/10/2019 A 23232.34
1/1/2019 A 923423.00
1/2/2019 A 332.00
1/3/2019 A 2342323.24
1/4/2019 A 232.00
1/5/2019 A 65.70
1/6/2019 B 875.46
1/7/2019 B 142466027340.03
1/8/2019 B 25.17
1/9/2019 B 1.01
1/10/2019 B 1.00
1/10/2019 B 57.61
1/6/2019 B 232232.78
1/7/2019 B 15.20
1/8/2019 B 44.56
1/9/2019 B 2323254.45
1/10/2019 B 395.45
1/10/2019 B 23423454.92
1/6/2019 C 34.12
1/7/2019 C 89.34
1/8/2019 C 44.57
1/9/2019 C 343.10
1/10/2019 C 232.34
df上的MinMaxScaler代码:
from sklearn.preprocessing import MinMaxScaler
df['Price'] = df['Price'].apply(lambda x: '{:.2f}'.format(x))
scaler=MinMaxScaler()
dff = df.groupby('Customer').Price.transform(lambda s:scaler.fit_transform(s.values.reshape(-1,1)).ravel())
dff = pd.DataFrame(dff)
dff['Price'] = dff['Price'].apply(lambda x: '{:.2f}'.format(x))
dff = pd.concat([dff['Price'] , df['Customer']] , axis=1)
dff输出:
Price Customer
0 1.00 A
1 0.00 A
2 0.00 A
3 0.00 A
4 0.00 A
5 0.01 A
6 0.00 A
7 0.02 A
8 0.00 A
9 0.00 A
10 0.00 B
11 1.00 B
12 0.00 B
.
.
.
.
20 0.00 B
21 0.00 B
22 0.00 C
23 0.18 C
24 0.03 C
25 1.00 C
26 0.64 C
inverse_transform函数代码以获取实际价格值:
dd = dff.groupby('Customer').Price.transform(lambda s: scaler.inverse_transform(s.values.reshape(-1,1)).ravel())
dd = pd.DataFrame(dd)
dd['Price'] = dd['Price'].apply(lambda x: '{:.2f}'.format(x))
dd = pd.concat([dd['Price'] , df['Customer']] , axis=1)
dd输出:
Price Customer
0 343.10 A
1 34.12 A
2 34.12 A
3 34.12 A
4 34.12 A
5 37.21 A
6 34.12 A
7 40.30 A
8 34.12 A
9 34.12 A
10 34.12 B
11 343.10 B
12 34.12 B
13 34.12 B
.
.
.
.
.
18 34.12 B
19 34.12 B
20 34.12 B
21 34.12 B
22 34.12 C
23 89.74 C
24 43.39 C
25 343.10 C
26 231.87 C
请对此提供帮助并建议我。
这里您在脚本中遇到多个问题:
df['Price'] = df['Price'].apply(lambda x: '{:.2f}'.format(x))
,您将价格列转换为字符串,因此任何数字运算对该dtype均无效。如果要设置显示的浮点精度,可以在导入熊猫行后使用pd.set_option('display.float_format', lambda x: '%.2f' % x)
。如果我理解得很好,您希望按比例调整每客户范围0-1值。如果是这种情况,您将无法像这样一行:
df.groupby('Customer')。Price.transform(lambdax:scaler.fit_transform(s.values.reshape(-1,1))。ravel())
因为每组缩放器对象的每次迭代都会覆盖前一组的最小/最大参数。因此,最后,调用scaler.inverse_transform()时,您具有最后一组的参数。
解决方案:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.7f' % x)
from sklearn.preprocessing import MinMaxScaler
scalers_dict = dict({}) # Prepare place to hold scalers per Customer
df = pd.read_csv('stack_data.csv', parse_dates=['Date'])
df['Price_scaled'] = None
df['Price_inverse_scaled'] = None
# Loop over Customers, scale Price and save scaler for further use
for customer in pd.unique(df['Customer']):
scalers_dict[customer] = MinMaxScaler()
scaled_price = scalers_dict[customer].fit_transform(df[df['Customer']==customer].Price.values.reshape(-1, 1))
df.loc[df['Customer']==customer, 'Price_scaled'] = scaled_price
# Loop over Customers and inverse scaled values
for customer in pd.unique(df['Customer']):
inverse_scale = scalers_dict[customer].inverse_transform(df[df['Customer']==customer].Price_scaled.values.reshape(-1, 1))
df.loc[df['Customer']==customer, 'Price_inverse_scaled'] = inverse_scale
输出(我将精度设置为.7f,因为最大值对于许多值来说太大,并且大多数都以双精度为0.00):
Date Customer Price Price_scaled Price_inverse_scaled
0 2019-01-06 A 142404534.1300000 1.0000000 142404534.1300000
1 2019-01-07 A 989.3400000 0.0000065 989.3400000
2 2019-01-08 A 45444.5700000 0.0003187 45444.5700000
3 2019-01-09 A 574343.1000000 0.0040327 574343.1000000
4 2019-01-10 A 23232.3400000 0.0001627 23232.3400000
5 2019-01-01 A 923423.0000000 0.0064840 923423.0000000
6 2019-01-02 A 332.0000000 0.0000019 332.0000000
7 2019-01-03 A 2342323.2400000 0.0164479 2342323.2400000
8 2019-01-04 A 232.0000000 0.0000012 232.0000000
9 2019-01-05 A 65.7000000 0.0000000 65.7000000
10 2019-01-06 B 875.4600000 0.0000000 875.4600000
11 2019-01-07 B 142466027340.0299988 1.0000000 142466027340.0299988
12 2019-01-08 B 25.1700000 0.0000000 25.1700000
13 2019-01-09 B 1.0100000 0.0000000 1.0100000
14 2019-01-10 B 1.0000000 0.0000000 1.0000000
15 2019-01-10 B 57.6100000 0.0000000 57.6100000
16 2019-01-06 B 232232.7800000 0.0000016 232232.7800000
17 2019-01-07 B 15.2000000 0.0000000 15.2000000
18 2019-01-08 B 44.5600000 0.0000000 44.5600000
19 2019-01-09 B 2323254.4500000 0.0000163 2323254.4500000
20 2019-01-10 B 395.4500000 0.0000000 395.4500000
21 2019-01-10 B 23423454.9200000 0.0001644 23423454.9200000
22 2019-01-06 C 34.1200000 0.0000000 34.1200000
23 2019-01-07 C 89.3400000 0.1787171 89.3400000
24 2019-01-08 C 44.5700000 0.0338210 44.5700000
25 2019-01-09 C 343.1000000 1.0000000 343.1000000
26 2019-01-10 C 232.3400000 0.6415302 232.3400000