我有下表形式的数据
Name Mas Sce
M ( (87) 83
(91)
(97) )
T (77) 76
R (60) 32
G (95) 20
M ( (50) 89
(50)
(99) )
我的一些数据贯穿多个列,例如 M case。数据包含在括号内
我尝试过删除重复项。它在单行时起作用。但是,现在我有几行作为一组
import pandas as pd
d = {'Name': ['M', None, None, 'T', 'R', 'G', 'M', '', ''],
'Mas': ['( (87)', '(91)', '(97) )', '(77)', '(60)', '(95)', '( (50)', '(50)', '(99) )'],
'Sce': ['83', '', '', '76', '32', '20', '89', '', '']}
df = pd.DataFrame(d)
df['Name'] = df['Name'].ffill()
print(df)
df.drop_duplicates(subset='Name', keep='first', inplace=True)
print(df)
我想删除重复出现的数据。在这种情况下,第二个 M
Name Mas Sce
M ( (87) 83
(91)
(97) )
T (77) 76
R (60) 32
G (95) 20
尝试:
# make the `Name` column consistend -> change "", None to NaNs
df["Name"] = np.where(df["Name"].isin(["", None]), np.nan, df["Name"])
# create a mask what to keep and what to discard
mask = ~(
pd.Series(
np.where(df["Name"].notna(), df["Name"].duplicated(keep="first"), np.nan),
index=df.index,
)
.ffill()
.astype(bool)
)
# print final df
print(df[mask])
打印:
Name Mas Sce
0 M ( (87) 83
1 NaN (91)
2 NaN (97) )
3 T (77) 76
4 R (60) 32
5 G (95) 20