查找（仅）满足pandas DataFrame中给定条件的第一行

Question

我有一个数据帧df，其中包含很长的随机正整数列：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

我想确定列中第一个偶数的索引。一种方法是：

df[df.n % 2 == 0].iloc[0]

但是这涉及很多操作（生成索引f.n % 2 == 0，在那些索引上评估df并最终获取第一项）并且非常慢。像这样的循环要快得多：

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

也因为第一个结果可能在前几行。是否有任何pandas方法以相似的性能执行此操作？谢谢。

注意：这个条件（是一个偶数）只是一个例子。我正在寻找适用于价值观的任何条件的解决方案，即快速单线替代：

df[ conditions on df.n ].iloc[0]

Answer 1

一些时间和是使用发电机通常会给你更快的结果

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

我明白了：

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 µs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 µs per loop # <--- your solution

但是当你调整大小时：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})

差异消失了：

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

您的解决方案是最快的，那么为什么不使用它呢？

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

Answer 2

我决定玩几种可能性的乐趣。我拿一个数据帧：

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

（这次不是随机的。）我想找到n >= N为N值的第一行。我有时间以下四个版本：

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

与N =十的权力。当然，numpy（优化的C）代码预计会比python中的for循环更快，但我想看看N python循环的哪些值仍然可以。

我计时：

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

为N = 1, 10, 100, 1000, ...。这是性能的对数日志图：

PICTURE

简单的for循环是可以的，只要“第一个真正的位置”预计在开始，但随后变得糟糕。 np.argmax是最安全的解决方案。

从图中可以看出，pandas和argmax的时间保持（几乎）不变，因为它们总是扫描整个数组。有一个没有的np或pandas方法是完美的。

Answer 3

Zip索引和列，然后循环，以获得更快的循环速度。 Zip提供最快的循环性能，比iterrows()或itertuples()更快。

for j in zip(df.index,df.n):
        if j[1] % 2 == 0:
                index_position = j[0]
                break

Answer 4

让你迭代行并在满意时停止的选项是使用DataFrame.iterrows，它是pandas的行迭代器。

在这种情况下，你可以实现这样的事情：

def get_first_row_with(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None # Condition not met on any row in entire DataFrame

然后，给定一个DataFrame，例如：

df = pd.DataFrame({
                    'cats': [1,2,3,4], 
                    'dogs': [2,4,6,8]
                  }, 
                  index=['Alice', 'Bob', 'Charlie', 'Eve'])

你可以用作：

def some_condition(row):
    return row.cats + row.dogs >= 7

index, row = get_first_row_with(some_condition, df)

# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))

哪个会输出：

Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!

查找（仅）满足pandas DataFrame中给定条件的第一行

问题描述投票：7回答：4

4个回答

最新问题

查找（仅）满足pandas DataFrame中给定条件的第一行

问题描述 投票：7回答：4

4个回答

最新问题

问题描述投票：7回答：4