我见过其他 Python 程序员在以下用例中使用集合模块中的 defaultdict:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
def main():
d = defaultdict(list)
for k, v in s:
d[k].append(v)
我通常通过使用 setdefault 来解决这个问题:
def main():
d = {}
for k, v in s:
d.setdefault(k, []).append(v)
文档实际上确实声称使用defaultdict更快,但我在测试自己时发现事实恰恰相反:
$ python -mtimeit -s "from withsetdefault import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 4.51 usec per loop
$ python -mtimeit -s "from withdefaultdict import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 5.38 usec per loop
我设置测试的方式有问题吗?
作为参考,我使用的是 Python 2.7.3 [GCC 4.2.1(Apple Inc. build 5666)
是的,有一些“错误”:
您已将
(default)dict
的创建而不是设置放入语句中。构造一个新的 defaultdict
比普通的 dict
更昂贵,通常这不是您应该在程序中分析的瓶颈 - 毕竟,您构建了一次数据结构,但多次使用它们。
如果您进行如下测试,您会发现
defaultdict
操作确实更快:
>>> import timeit
>>> setup1 = """from collections import defaultdict
... s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = defaultdict(list)"""
>>> stmt1 = """for k, v in s:
... d[k].append(v)"""
>>> setup2 = """s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = {}"""
>>> stmt2 = """for k, v in s:
... d.setdefault(k, []).append(v)"""
>>> timeit.timeit(setup=setup1, stmt=stmt1)
1.0283400125194078
>>> timeit.timeit(setup=setup2, stmt=stmt2)
1.7767367580925395
Win7 x64 上的 Python 2.7.3。
添加到蒂姆的答案中,当您创建多个默认字典时,添加元素会变得更加昂贵。虽然初始化会增加时间,但主要区别在于将元素添加到多个默认字典所需的时间。
这里有一些代码来显示观察结果
import time
from collections import defaultdict
def time_taken_mult_dd(sz):
t1 = t2 = 0
for i1 in range(1000000):
start1 = time.time()
d = defaultdict(int)
end1 = time.time()
start2 = time.time()
for i2 in range(sz):
d[i2] += 1
end2 = time.time()
t1 += end1 - start1
t2 += end2 - start2
print(f"Time taken for initialization {t1:5.2f}e-06")
print(f"Time taken for addition {t2:5.2f}e-06")
def time_taken_one_dd(sz):
t = 0
d = defaultdict(int)
for i1 in range(1000000):
start = time.time()
for i2 in range(sz):
d[i2] += 1
end = time.time()
t += end - start
print(f"Time taken for addition {t:5.2f}e-06")
这是输出。观察初始化时间如何保持不变,而添加元素却变得更加耗时
>>> time_taken_one_dd(10)
Time taken for addition 1.59e-06
>>> time_taken_mult_dd(10)
Time taken for initialization 0.41e-06
Time taken for addition 2.64e-06
>>>
>>> time_taken_one_dd(50)
Time taken for addition 6.46e-06
>>> time_taken_mult_dd(50)
Time taken for initialization 0.59e-06
Time taken for addition 11.80e-06
>>>
>>> time_taken_one_dd(100)
Time taken for addition 12.61e-06
>>> time_taken_mult_dd(100)
Time taken for initialization 0.71e-06
Time taken for addition 23.05e-06