*我重新发布了这个问题,因为我错过了之前的问题中的一些重要内容。
我有如下所示的DataFrame
Email-adress Body
abcd@gmail Hi, I am xxxx. ======= ABCD corporation Chris =======
asdff@gmail Thank you for the information. Bruh bruh. ------CDDD inc name-----
并且从DF的此Body列中,我想删除所有签名,所以输出如下所示
output
Email-adress Body
abcd@gmail Hi, I am xxxx.
asdff@gmail Thank you for the information. Bruh bruh.
我已经获得了如下所示的代码示例
stri=stri.split()
for i in range(len(stri)):
if ('====' in stri[i]) or ('----' in stri[i]):
stri=stri[:i]
break
print(' '.join(stri))
但是这是通过在正文中放置“ stri” =文本,并且此方法仅适用于一行。但是我想知道如何将其应用于每一行,以便摆脱每一行的所有签名。
使用正则表达式:
def remove_sign(row):
return re.sub(r"=.*=|-.*-", "", row)
df['Body'] = df['Body'].apply(remove_sign)
df:
Email-adress Body
0 abcd@gmail Hi, I am xxxx.
1 asdff@gmail Thank you for the information. Bruh bruh.
尝试一下:
#searches for alpha numeric characters, space, comma and period
df.Body.str.extract(r'([\w,.\s]+)')
0
0 Hi, I am xxxx.
1 Thank you for the information. Bruh bruh.
OR:
#look for characters that start before = or -
#the ? after the + sign gets the least amount ... non-greedy mode
df.Body.str.extract(r'(.+?(?=[=-]))')