Python集合模块中的defaultdict真的比使用setdefault更快吗?

问题描述 投票:0回答:2

我见过其他 Python 程序员在以下用例中使用集合模块中的 defaultdict:

from collections import defaultdict

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

def main():
    d = defaultdict(list)
    for k, v in s:
        d[k].append(v)

我通常通过使用 setdefault 来解决这个问题:

def main():
    d = {}
    for k, v in s:
        d.setdefault(k, []).append(v)

文档实际上确实声称使用defaultdict更快,但我在测试自己时发现事实恰恰相反:

$ python -mtimeit -s "from withsetdefault import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 4.51 usec per loop
$ python -mtimeit -s "from withdefaultdict import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 5.38 usec per loop

我设置测试的方式有问题吗?

作为参考,我使用的是 Python 2.7.3 [GCC 4.2.1(Apple Inc. build 5666)

python collections defaultdict setdefault python-collections
2个回答
25
投票

是的,有一些“错误”:

您已将

(default)dict
的创建而不是设置放入语句中。构造一个新的
defaultdict
比普通的
dict
更昂贵,通常这不是您应该在程序中分析的瓶颈 - 毕竟,您构建了一次数据结构,但多次使用它们。

如果您进行如下测试,您会发现

defaultdict
操作确实更快:

>>> import timeit
>>> setup1 = """from collections import defaultdict
... s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = defaultdict(list)"""
>>> stmt1 = """for k, v in s:
...     d[k].append(v)"""
>>> setup2 = """s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = {}"""
>>> stmt2 = """for k, v in s:
...     d.setdefault(k, []).append(v)"""
>>> timeit.timeit(setup=setup1, stmt=stmt1)
1.0283400125194078
>>> timeit.timeit(setup=setup2, stmt=stmt2)
1.7767367580925395

Win7 x64 上的 Python 2.7.3。


0
投票

添加到蒂姆的答案中,当您创建多个默认字典时,添加元素会变得更加昂贵。虽然初始化会增加时间,但主要区别在于将元素添加到多个默认字典所需的时间。

这里有一些代码来显示观察结果

import time
from collections import defaultdict

def time_taken_mult_dd(sz):
    t1 = t2 = 0
    for i1 in range(1000000):
        start1 = time.time()
        d = defaultdict(int)
        end1 = time.time()
        
        start2 = time.time()
        for i2 in range(sz):
            d[i2] += 1
        end2 = time.time()
        
        t1 += end1 - start1
        t2 += end2 - start2
    
    print(f"Time taken for initialization {t1:5.2f}e-06")
    print(f"Time taken for addition       {t2:5.2f}e-06")

def time_taken_one_dd(sz):
    t = 0
    d = defaultdict(int)
    for i1 in range(1000000):
        
        start = time.time()
        for i2 in range(sz):
            d[i2] += 1
        end = time.time()
        
        t += end - start
    
    print(f"Time taken for addition       {t:5.2f}e-06")

这是输出。观察初始化时间如何保持不变,而添加元素却变得更加耗时

>>> time_taken_one_dd(10)
Time taken for addition        1.59e-06
>>> time_taken_mult_dd(10)
Time taken for initialization  0.41e-06
Time taken for addition        2.64e-06
>>> 
>>> time_taken_one_dd(50)
Time taken for addition        6.46e-06
>>> time_taken_mult_dd(50)
Time taken for initialization  0.59e-06
Time taken for addition       11.80e-06
>>> 
>>> time_taken_one_dd(100)
Time taken for addition       12.61e-06
>>> time_taken_mult_dd(100)
Time taken for initialization  0.71e-06
Time taken for addition       23.05e-06

© www.soinside.com 2019 - 2024. All rights reserved.