我有2个数据框:df1是邮箱和电子邮件ID的列表df2显示已批准域的列表
我从Excel工作表中读取了两个数据框
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
我只想将记录保留在df1中,其中df1 [Email_Id]包含df2 [approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 [email protected]
1 mailbox2 [email protected]
2 mailbox3 [email protected]
print(df2)
approved_domain
0 msn.com
1 gmail.com
而且我想要df3基本上显示
print (df3)
Mailbox Email_Id
0 mailbox1 [email protected]
1 mailbox3 [email protected]
这是我现在拥有的代码,我认为它很接近,但是我无法弄清楚语法中的确切问题
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
但是出现此错误
TypeError: unhashable type: 'list'
我花了很多时间在论坛上研究解决方案,但是找不到我想要的东西。感谢所有帮助。
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['[email protected]', '[email protected]', '[email protected]']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', '[email protected]')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
输出:
> {'Email_ID': ('[email protected]', '[email protected]'), 'MailBox': ('mailbox1', 'mailbox3')}
一些注意事项:
1。将您的email_address列拆分为两个单独的列
df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str
2。然后放下添加列以保持数据框干净
df1 = df1.drop('add',axis =1)
3。通过在“域”列中选择与“ approved_doman”列不匹配的任何值来获取仅具有所需值的新数据框
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4。在“ df_new”中删除“域”列
df_new = df_new.drop('domain',axis = 1)
这将是结果
mailbox email_address 1 mailbox2 [email protected] 2 mailbox3 [email protected]