我想做一个特别的 fillna()
在以下数据集上,如下所示。
name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?
在这个数据集中 ?
指任何 non-integer value
喜欢 na
或 ???
A spend
价值 ?
A,B,C行的平均值必须用该组的平均值代替,即 ?
应替换为 np.mean(A),np.mean(B),np.mean(C)
对于 C
没有其他值,所以只能是 0
我们不能直接应用 fillna(np.mean)
在这种情况下。
这里有一个解决方案。
df = df.replace("?", np.NaN)
df.spend = pd.to_numeric(df.spend)
df.recieved = pd.to_numeric(df.recieved)
df.loc[df.spend.isna(), "spend"] = df.groupby("name").transform("mean").loc[df.spend.isna(), "spend"]
df["spend"] = df.spend.fillna(0)
结果:
name spend recieved
0 A 1012.0 1200.0
1 A 1012.0 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B 1650.0 NaN
5 C 0.0 NaN
6 C 0.0 NaN
解决方法:
pd.read_csv(..., na_values='?')
在读取时替换您的NaNs所以关键的一行是
df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)
代码:
import pandas as pd
from io import StringIO
dat = """name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?"""
df = pd.read_csv(StringIO(dat), na_values='?')
name spend received
0 A 1012.0 1200.0
1 A NaN 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B NaN NaN
5 C NaN NaN
6 C NaN NaN
df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)
name spend received
0 A 1012.0 1200.0
1 A 1012.0 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B 1650.0 NaN
5 C 0.0 NaN
6 C 0.0 NaN
假设? 也可以是字符串
import pandas as pd
import numpy as np
idx = ['A'] * 3 + ['B'] * 3 + ['C'] * 3
data = np.random.random_sample((9,2))
df = pd.DataFrame(index=idx, data=data[::], columns=['spend', 'recieved'])
df.index.name = 'name'
df.iloc[2, 1] = np.nan
df.iloc[1, 0] = 'ABCD'
df.iloc[4:6, 0] = np.nan
df
name spend recieved
A 0.197366 0.467532
A ABCD 0.256184
A 0.559562 NaN
B 0.59835 0.415382
B NaN 0.163827
B NaN 0.759888
C 0.897332 0.025344
C 0.782683 0.428465
C 0.201591 0.601339
然后
df = df.apply(pd.to_numeric, errors='coerce')
df['spend'] = df['spend'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))
df['recieved'] = df['recieved'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))
这就产生了:
name spend recieved
A 0.197366 0.467532
A 0.378464 0.256184
A 0.559562 0.361858
B 0.598350 0.415382
B 0.598350 0.163827
B 0.598350 0.759888
C 0.897332 0.025344
C 0.782683 0.428465
C 0.201591 0.601339