我有大量包含整数值列表的行,我只想保留其中包含值 1 的列表。所以在下面的列表中,我会保留第二个和第三个,但放弃第一个
0 100033364389 [10, 11, 12, 2, 5, 7, 8, 9]
1 100036364396 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 100077364447 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]`
但我唯一得到的是一个只有列名的空表
当我尝试使用 == 设置条件时它完美地工作但是 != not
你对此有什么想法吗?
假设
month
列,使用带有 布尔索引 的简单列表理解,这将是最快的:
out = df[[1 in l for l in df['month']]]
输出:
number month
0 100036364396 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 100077364447 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
使用的输入:
df = pd.DataFrame({'number': [100033364389, 100036364396, 100077364447],
'month': [[10, 11, 12, 2, 5, 7, 8, 9],
[10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]]})
boolean indexing
:
df1 = df[[1 in x for x in df['month']]]
或:
df1 = df[df['month'].apply(lambda x: 1 in x)]
print (df1)
number month
1 100036364396 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
2 100077364447 [10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Performance 300k rows from sample data:
#[300000 rows x 2 columns]
df = pd.concat([df] * 100000, ignore_index=True)
print (df)
%timeit df[[1 in x for x in df['month']]]
150 ms ± 9.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['month'].apply(lambda x: 1 in x)]
100 ms ± 9.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)