我有两个如下所示的数据框
import pandas as pd
df1 = pd.DataFrame(
{
"Server": ["Server1", "Server1","Server1","Server1","Server1"],
"FileName": [
"2020-05-01T18:18:00Z/Server1/file1",
"2020-05-01T18:18:13Z/Server1/file2",
"2020-05-01T18:20:47Z/Server1/file3",
"2020-05-01T18:21:46Z/Server1/file4",
"2020-05-01T18:24:43Z/Server1/file5",
],
}
)
df2 = pd.DataFrame(
{
"Server": ["Server1", "Server1","Server1","Server1","Server1"],
"FileName": [
"2020-05-01T18:18:00Z/Server1/file1",
"2020-05-01T18:18:13Z/Server1/file2",
"2020-05-01T18:20:47Z/Server1/file3",
"2020-05-01T18:33:08Z/Server1/file6",
"2020-05-01T18:33:11Z/Server1/file7",
],
}
)
df1:
FileName Server
0 2020-05-01T18:18:00Z/Server1/file1 Server1
1 2020-05-01T18:18:13Z/Server1/file2 Server1
2 2020-05-01T18:20:47Z/Server1/file3 Server1
3 2020-05-01T18:21:46Z/Server1/file4 Server1
4 2020-05-01T18:24:43Z/Server1/file5 Server1
df2:
FileName Server
0 2020-05-01T18:18:00Z/Server1/file1 Server1
1 2020-05-01T18:18:13Z/Server1/file2 Server1
2 2020-05-01T18:20:47Z/Server1/file3 Server1
3 2020-05-01T18:33:08Z/Server1/file6 Server1
4 2020-05-01T18:33:11Z/Server1/file7 Server1
我想要来自df1的文件,这些文件不在df2中。列服务器在这里无关紧要。我想要下面的数据框
FileName Server
0 2020-05-01T18:21:46Z/Server1/file4 Server1
1 2020-05-01T18:24:43Z/Server1/file5 Server1
我已经通过遍历每个值来实现这一点。有没有任何简便的方法可以做到这一点。
df = pd.DataFrame()
for index1, row1 in df1.iterrows():
flag = 0
for index2, row2 in df2.iterrows():
if row1['FileName'] == row2['FileName']:
flag = 1
if flag == 0:
df = df.append({'Server': row1['Server'], 'FileName': row1['FileName']}, ignore_index=True)
print df
我不确定这样做的效率如何,但是您可以使用这1个线性代码而不是使用循环来迭代数据帧。
result = pd.DataFrame(df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only'])
del result["_merge"] #You can keep this _merge column
print(result)
Server FileName
3 Server1 2020-05-01T18:21:46Z/Server1/file4
4 Server1 2020-05-01T18:24:43Z/Server1/file5
这将起作用:
df1[df1['FileName'] != df2['FileName']].reset_index(drop=True)
您可以使用isin方法
df1[~df1['FileName'].isin(df2['FileName'])]