对大熊猫据帧编索引的查找。为什么这么慢？如何加快？ [重复]

Question

这个问题已经在这里有一个答案：

What is the performance impact of non-unique indexes in pandas? 2个回答

假设我有一个熊猫系列，我想作为一个多重映射（每个索引键多个值）：

# intval -> data1
a = pd.Series(data=-np.arange(100000),
              index=np.random.randint(0, 50000, 100000))

我想从那里a a的指数另一指数b一致选择（尽快）的所有值。（像内部联接，或者合并而是系列）。

a可以具有在其索引重复。
b可能不会有重复的，它不一定a指数的子集。为了让大熊猫的最佳机会，让我们假设b也可以作为一个分类索引对象提供：

     b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()

所以，我们会碰到这样的：

                      target  
   a        b         result
3  0        3      3  0
3  1        7      8  3 
4  2        8      ...     
8  3      ...
9  4
...

我也只有在得到结果（[3,8,...]不需要指数）值感兴趣。

如果a没有重复，我们只会做：

a.reindex(b)  # Cannot reindex a duplicate axis

由于&保持a的重复，我们不能做的：

d = a[a.index & b.index]
d = a.loc[a.index & b.index]  # same
d = a.get(a.index & b.index)  # same
print d.shape

因此，我认为我们需要做的是这样的：

common = (a.index & b.index).unique()
a.loc[common]

......这是麻烦的，而且是令人惊讶的慢。这不是建项目的列表选择的速度慢：

%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop

...所以它看起来像它真的检索这是缓慢的值：

common = ((a.index & b).unique()).sort_values()

%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop

%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop

......这是每秒20左右的操作。不完全是比比！为什么这么慢？

当然，必须有查找作为集从熊猫数据框中值的快速方法？我不想让索引对象了 - 真的所有我所要求的是在分类索引的合并，或（更慢）散列INT查找。无论哪种方式，这应该是一个非常快速的操作 - 不每秒运行20我的3GHz CPU的机器上。

也：

剖析a.loc[common]给：

ncalls  tottime  percall  cumtime   percall filename:lineno(function)
# All the time spent here.
40      1.01     0.02525  1.018     0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500    0.000582 3.88e-07 0.000832  5.547e-07 ~:0(<isinstance>)

PS。我张贴了类似的问题之前，为什么Series.map是如此缓慢Why is pandas.series.map so shockingly slow?。原因是懒惰下引擎盖索引。这似乎并不在这里发生。

更新：

对于同样大小和常见的，其中一个是独一无二的：

% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop

...作为@jpp指出。多指标可能是罪魁祸首。

Answer 1

重复索引保证放慢你的数据帧索引操作。你可以修改你的输入，以证实这一点：

a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common]  # 34.1 ms

a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common]  # 6.86 ms

正如this related question提到：

当索引是唯一的，熊猫使用散列表来映射键值O（1）。当指数是非唯一和分类，熊猫使用二进制搜索O（logN）的，当指数是随机的有序的大熊猫需要检查在指数O（N），所有的按键。

对大熊猫据帧编索引的查找。为什么这么慢？如何加快？ [重复]

问题描述投票：2回答：1

1个回答

最新问题

对大熊猫据帧编索引的查找。为什么这么慢？如何加快？ [重复]

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1