如何获得等效的SQL IN
和NOT IN
?
我有一个包含所需值的列表。这是场景:
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']
# pseudo-code:
df[df['countries'] not in countries]
我目前的操作方法如下:
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})
# IN
df.merge(countries,how='inner',on='countries')
# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]
但是这似乎是一个可怕的冲突。任何人都可以改进吗?
pd.Series.isin
。 对于“ IN”,使用:pd.Series.isin
或对于“ NOT IN”:something.isin(somewhere)
作为工作示例:
~something.isin(somewhere)
>>> df
countries
0 US
1 UK
2 Germany
3 China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0 False
1 True
2 False
3 True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
countries
1 UK
3 China
>>> df[~df.countries.isin(countries)]
countries
0 US
2 Germany
方法的替代解决方案:如何为pandas DataFrame实现'in'和'not in'?
In [5]: df.query("countries in @countries")
Out[5]:
countries
1 UK
3 China
In [6]: df.query("countries not in @countries")
Out[6]:
countries
0 US
2 Germany
和Series.isin
。Series.isin
条件以过滤DataFrame中的行。
DataFrame.isin
接受各种类型的输入。以下是获取所需内容的所有有效方法:
isin
df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']}) df countries 0 US 1 UK 2 Germany 3 China c1 = ['UK', 'China'] # list c2 = {'Germany'} # set c3 = pd.Series(['China', 'US']) # Series c4 = np.array(['US', 'UK']) # array
Series.isin
df['countries'].isin(c1) 0 False 1 True 2 False 3 False 4 True Name: countries, dtype: bool # `in` operation df[df['countries'].isin(c1)] countries 1 UK 4 China # `not in` operation df[~df['countries'].isin(c1)] countries 0 US 2 Germany 3 NaN
对许多列进行过滤有时,您可能希望在多个列上应用带有某些搜索词的'in'成员资格检查,
# Filter with `set` (tuples work too) df[df['countries'].isin(c2)] countries 2 Germany
要将# Filter with another Series df[df['countries'].isin(c3)] countries 0 US 4 China
条件应用于“ A”和“ B”列,请使用# Filter with array df[df['countries'].isin(c4)] countries 0 US 1 UK
:
df2 = pd.DataFrame({ 'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)}) df2 A B C 0 x w 0 1 y a 1 2 z NaN 2 3 q x 3 c1 = ['x', 'w', 'p']
由此,要保留至少一列为
isin
的行,我们可以沿第一个轴使用DataFrame.isin
:df2[['A', 'B']].isin(c1) A B 0 True True 1 False False 2 False False 3 False True
注意,如果要搜索每一列,只需省略列选择步骤,然后执行
True
类似地,要保留所有列均为
any
的行,请以与以前相同的方式使用df2[['A', 'B']].isin(c1).any(axis=1) 0 True 1 False 2 False 3 True dtype: bool df2[df2[['A', 'B']].isin(c1).any(axis=1)] A B C 0 x w 0 3 q x 3
。df2.isin(c1).any(axis=1)
值得注意的提示:True
,all
,列表理解(字符串数据)除了上述方法之外,您还可以使用numpy等效项:
df2[df2[['A', 'B']].isin(c1).all(axis=1)] A B C 0 x w 0
。
numpy.isin
为什么值得考虑? NumPy函数通常比同等的熊猫要快一些,因为它们的开销较低。由于这是不依赖于索引对齐的元素操作,因此在极少数情况下此方法不能适当替代熊猫的query
。Pandas例程在处理字符串时通常是迭代的,因为字符串操作很难向量化。
numpy.isin
。我们现在求助于numpy.isin
检查。
# `in` operation df[np.isin(df['countries'], c1)] countries 1 UK 4 China # `not in` operation df[np.isin(df['countries'], c1, invert=True)] countries 0 US 2 Germany 3 NaN
但是,指定起来要麻烦得多,因此除非您知道自己在做什么,否则不要使用它。最后,
isin
中也包含了There is a lot of evidence to suggest that list comprehensions will be faster here.。 numexpr FTW!
df[-df["A"].isin([3, 6])]