我有两列是逗号分隔的单词和单字组合的字符串格式。col1
将始终只包含一个单词。在这个例子中,我将使用单词 犬 语带 col1
但这在实际数据中会有差异,所以请不要用regex来解决。犬 具体来说。
df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
"col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})
我想检查一下,如果在 col1
字符串中出现 col2
如果有的话,我想把这个词从... ... col2
. 但请记住,如果还有更多的字,我想保留字符串的其余部分。所以会从这个
col1 col2
0 Dog Cat, Mouse
1 Dog Dog
2 Dog Cat
3 Dog Dog, Mouse
变成这样:
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
(^,|,$)
处理起始&尾部的逗号。 (,\s|,)
替换操作后,将删除保留的逗号。 {1,}
跳过不重复的逗号
df['col2'] = df['col2'].str. \
replace("|".join(df['col1'].unique()), "").str.strip() \
.str.replace("(?:^,|,$)", "") \
.str.replace("(?:,\s|,){1,}", ",")
col1 col2
0 Dog Cat,Mouse
1 Dog
2 Dog Cat
3 Dog Mouse,Mouse
IIUC:
import re
df['col2'] = [(re.sub(fr"({word}[\s,]*)","",sentence))
for word,sentence in zip(df.col1,df.col2)]
df
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
另一个DF,中间有狗。
df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog","Dog"],
"col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse", "Cat, Dog, Mouse"]})
df
col1 col2
0 Dog Cat, Mouse
1 Dog Dog
2 Dog Cat
3 Dog Dog, Mouse
4 Dog Cat, Dog, Mouse
应用上面的代码。
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
4 Dog Cat, Mouse