用规则来统一一系列dicts的最快方法?

问题描述 投票:1回答:5

我有一个dicts列表:

list1 = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

我需要对这些唯一排序:

  • T应该是独一无二的
  • 无论哪个dict的V更大都应该优先考虑

哪个应该产生:

[
  {'T': 1234, 'V': 10, 'O': 1}, 
  {'T': 2345, 'V': 50, 'O': 5}, 
  {'T': 3456, 'V': 40, 'O': 91}
]

我想出了这个:

interm = {o['T']: o for o in list1}
for o in list1:
  if o['V'] > interm[o['T']]['V']:
    interm[o['T']] = o

但是我有效地迭代列表两次,并多次设置字典值。这感觉可以改进,但我不知道我怎么能这样做。

有没有更快的方法来实现这个限制?

python
5个回答
2
投票

假设您的列表已经按照T排序,您只需在一次传递中跟踪最大V元素,并替换最大值(如果找到):

list1 = [
    { 'T': 1234, 'V': 10, 'O': 1 },
    { 'T': 2345, 'V': 50, 'O': 5 },
    { 'T': 2345, 'V': 30, 'O': 3 },
    { 'T': 3456, 'V': 40, 'O': 91 },
] 

unique = {}
for dic in list1:
    key = dic['T']
    found = unique.get(key)

    # If value found and doesn't exceed current maximum, just ignore
    if found and dic['V'] <= found['V']:
        continue

    # otherwise just update normally
    unique[key] = dic

print(list(unique.values()))
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]

如果您的列表不能保证按T排序,您可以预先使用T作为排序key进行排序:

from operator import itemgetter

sorted(list1, key=itemgetter('T'))

使用上面的operator.itemgetter与使用相同:

sorted(list1, key=lambda x: x['T'])

3
投票

假设list1已经被T排序,你可以使用itertools.groupby

from itertools import groupby

li = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

output = [max(group, key=lambda d: d['V'])
          for _, group in groupby(li, key=lambda d: d['T'])]

print(output)
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]

如果不是,groupby仍然可以与sort一起使用以实现O(nlogn)解决方案

order_by_t = lambda d: d['T']

li.sort(key=order_by_t)

output = [max(group, key=lambda d: d['V'])
          for _, group in groupby(li, key=order_by_t)]

2
投票

这是一步一步的方法。它会迭代您的列表并构建一个新列表:

list1 = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

# add this step if not already sorted by T
# list1 = sorted(list1, key = lambda x: x["T"]) 

list2 = []
for e in list1:
    t, v, o = e["T"], e["V"], e["O"]

    # we already stored something and same T
    if list2 and list2[-1]["T"] == t:

        # smaller V ?
        if list2[-1]["V"] < v:
            # overwrite dict elements
            list2[-1]["V"] = v
            list2[-1]["O"] = o

    # did not store anything or other T
    else:
        list2.append(e)

print(list2)

输出:

[{'T': 1234, 'O': 1, 'V': 10}, 
 {'T': 2345, 'O': 5, 'V': 50}, 
 {'T': 3456, 'O': 91, 'V': 40}]

2
投票

问题是“最快”的方式 - 我用给定的数据计算当前的方法 - 似乎RoadRunners在这个数据集上工作最快,我的第二手Deep Space解决方案第三。

>>> import timeit
>>> timeit.timeit(p1,setup=up)        # https://stackoverflow.com/a/54957067/7505395
2.5858893489556913
>>> timeit.timeit(p2,setup=up)        # https://stackoverflow.com/a/54957090/7505395
0.8051884429499854
>>> timeit.timeit(p3,setup=up)        # https://stackoverflow.com/a/54957156/7505395
0.7680418536661247

测试代码:

up = """from itertools import groupby

li = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]"""

资料来源:https://stackoverflow.com/a/54957067/7505395

p1 = """
# li.sort(key=lambda x:x["T"]) # for the random data
output = [max(group, key=lambda d: d['V'])
        for _, group in groupby(li, key=lambda d: d['T'])]
"""

资料来源:https://stackoverflow.com/a/54957090/7505395

p2 = """
# li.sort(key=lambda x:x["T"]) # for the random data
list2 = []
for e in li:
    t, v, o = e["T"], e["V"], e["O"]

    # we already stored something and same T
    if list2 and list2[-1]["T"] == t:

        # smaller V ?
        if list2[-1]["V"] < v:
            # overwrite dict elements
            list2[-1]["V"] = v
            list2[-1]["O"] = o

    # did not store anything or other T
    else:
        list2.append(e)
"""

资料来源:https://stackoverflow.com/a/54957156/7505395

p3 = """
unique = {}
for dic in li:
    key = dic['T']
    found = unique.get(key)

    # If value found and doesn't exceed current maximum, just ignore
    if found and dic['V'] <= found['V']:
        continue

    # otherwise just update normally
    unique[key] = dic 
"""

编辑(随机10k数据 - 排序和未排序)以查看它是否依赖于数据:

随机数据:qazxsw poi 10000个数据点

T [1,100] - V [10,20,..,200] - "O" [1,1000000]

资料来源:up = """ from itertools import groupby import random random.seed(42) def r(): # few T so we get plenty of dupes return {"T":random.randint(1,100), "V":random.randint(1,20)*10, "O":random.randint(1,1000000)} li = [ r() for _ in range(10000)] # li.sort(key=lambda x:x["T"]) # uncommented for pre-sorted run """

https://stackoverflow.com/a/54957067/7505395

资料来源:p1 = """ li.sort(key=lambda x:x["T"]) # needs sorting, commented for pre-sorted run output = [max(group, key=lambda d: d['V']) for _, group in groupby(li, key=lambda d: d['T'])] """

https://stackoverflow.com/a/54957090/7505395

资料来源:p2 = """ li.sort(key=lambda x:x["T"]) # needs sorting, commented for pre-sorted run list2 = [] for e in li: t, v, o = e["T"], e["V"], e["O"] # we already stored something and same T if list2 and list2[-1]["T"] == t: # smaller V ? if list2[-1]["V"] < v: # overwrite dict elements list2[-1]["V"] = v list2[-1]["O"] = o # did not store anything or other T else: list2.append(e) """

https://stackoverflow.com/a/54957156/7505395

资料来源:p3 = """ unique = {} for dic in li: key = dic['T'] found = unique.get(key) # If value found and doesn't exceed current maximum, just ignore if found and dic['V'] <= found['V']: continue # otherwise just update normally unique[key] = dic """

https://stackoverflow.com/a/54957363/7505395

在p1 / p2内排序的结果:

p4 = """ 
t_v = {}
result = []
for row in li:
    if not t_v.get(row['T']):
        t_v[row['T']] = (row['V'], len(result))
        result.append(row)
        continue

    if row['V'] > t_v[row['T']][0]:
        t_v[row['T']] = (row['V'], t_v[row['T']][1])
        result[t_v[row['T']][1]] = row
"""

预分类数据的结果:

import timeit
timeit.timeit(p1,setup=up, number=100)       0.4958197257468498      4th
timeit.timeit(p2,setup=up, number=100)       0.4506078658396253      3rd
timeit.timeit(p3,setup=up, number=100)       0.24399979946368378     1st
timeit.timeit(p4,setup=up, number=100)       0.2561938286132954      2nd

2
投票

要做到这一点是在未排序的表上的单个循环中,我创建了一个查找表来存储有关当前结果数组的信息。查找表将“T”存储为具有“V”值的键和结果列表中项目的索引。

循环遍历数据时,您可以根据查找表键检查“T”值。

如果密钥不存在,请添加它。

如果它确实将其值与行'V'值进行比较。

如果当前行'V'更大,则可以使用存储的索引替换上一行。

timeit.timeit(p1,setup=up, number=100)       0.3046940103986765      3rd
timeit.timeit(p2,setup=up, number=100)       0.33943337437485366     4th
timeit.timeit(p3,setup=up, number=100)       0.2795306502784811      1st
timeit.timeit(p4,setup=up, number=100)       0.29027710723995326     2nd

结果:

arr = [
    {'T': 2345, 'V': 50, 'O': 5},
    {'T': 1234, 'V': 10, 'O': 1},
    {'T': 2345, 'V': 30, 'O': 3},
    {'T': 3456, 'V': 40, 'O': 91},
]


def filter_out_lowest_values(arr):
lookup = {}
result = []
for row in arr:
    row_key, row_value = row['T'], row['V']
    if not lookup.get(row_key):
        lookup[row_key] = (row_value, len(result))
        result.append(row)
        continue

    lookup_value, result_index = lookup[row_key][0], lookup[row_key][1]
    if row_value > lookup_value:
        lookup[row_key] = (row_value, result_index)
        result[result_index] = row

return result


print(filter_out_lowest_values(arr))

要回答什么是最快的方式来统一列表,请参阅下面的基准。

它高度依赖于提供的数据。列表的长度,无论是否已排序,以及唯一键的数量都起作用。

根据我的基准测试,我发现Patrick Artners在排序列表中排名最快。一旦它的查找表完全填充,我自己在未排序列表中是最快的。

基准比较

对于每个> [{'T': 1234, 'V': 40, 'O': 91}, {'T': 2345, 'V': 150, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}] 值,每个脚本已运行100次,已绘制最快(最小)运行时。

n

Unsorted Data Benchmarks

Unsorted Benchmarks N = 10 ------ | min | avg | max | func | name | |---------------|---------------|---------------|----------------------------|------------------| | 0.000006437 | 0.000007293 | 0.000022173 | sarcoma | sarcoma | | 0.000007153 | 0.000007646 | 0.000017881 | road_runner_with_sort | RoadRunner | | 0.000007868 | 0.000008337 | 0.000013351 | patrick_artner_with_sort | Patrick_Artner | | 0.000015497 | 0.000017719 | 0.000026703 | deep_space_with_sort | DeepSpace | N = 100 ------ | min | avg | max | func | name | |---------------|---------------|---------------|----------------------------|------------------| | 0.000043154 | 0.000045519 | 0.000057936 | road_runner_with_sort | RoadRunner | | 0.000053883 | 0.000056396 | 0.000069141 | sarcoma | sarcoma | | 0.000055075 | 0.000057223 | 0.000063181 | patrick_artner_with_sort | Patrick_Artner | | 0.000135660 | 0.000145028 | 0.000174046 | deep_space_with_sort | DeepSpace | N = 1000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|----------------------------|------------------| | 0.000294447 | 0.000559096 | 0.000992775 | road_runner_with_sort | RoadRunner | | 0.000327826 | 0.000374844 | 0.000650883 | patrick_artner_with_sort | Patrick_Artner | | 0.000344276 | 0.000605364 | 0.002207994 | sarcoma | sarcoma | | 0.000758171 | 0.001031160 | 0.002290487 | deep_space_with_sort | DeepSpace | N = 10000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|----------------------------|------------------| | 0.003607988 | 0.003875387 | 0.005285978 | road_runner_with_sort | RoadRunner | | 0.003780127 | 0.004181504 | 0.005370378 | sarcoma | sarcoma | | 0.003986597 | 0.004258037 | 0.006756544 | patrick_artner_with_sort | Patrick_Artner | | 0.007097244 | 0.007444410 | 0.009983778 | deep_space_with_sort | DeepSpace | N = 25000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|----------------------------|------------------| | 0.009672165 | 0.010055504 | 0.011536598 | sarcoma | sarcoma | | 0.019844294 | 0.022260010 | 0.027792931 | road_runner_with_sort | RoadRunner | | 0.020462751 | 0.022415347 | 0.029330730 | patrick_artner_with_sort | Patrick_Artner | | 0.024955750 | 0.027981100 | 0.031506777 | deep_space_with_sort | DeepSpace

Sorted Data Benchmarks

基准脚本可以在:Sorted Benchmarks N = 10 ------ | min | avg | max | func | name | |---------------|---------------|---------------|------------------|------------------| | 0.000002861 | 0.000003138 | 0.000005960 | road_runner | RoadRunner | | 0.000002861 | 0.000003231 | 0.000012398 | patrick_artner | Patrick_Artner | | 0.000004292 | 0.000004461 | 0.000007629 | sarcoma | sarcoma | | 0.000008821 | 0.000009136 | 0.000011921 | deep_space | DeepSpace | N = 100 ------ | min | avg | max | func | name | |---------------|---------------|---------------|------------------|------------------| | 0.000020027 | 0.000020833 | 0.000037909 | road_runner | RoadRunner | | 0.000021458 | 0.000024126 | 0.000087738 | patrick_artner | Patrick_Artner | | 0.000033140 | 0.000034373 | 0.000049591 | sarcoma | sarcoma | | 0.000072241 | 0.000073054 | 0.000085592 | deep_space | DeepSpace | N = 1000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|------------------|------------------| | 0.000200748 | 0.000207791 | 0.000290394 | patrick_artner | Patrick_Artner | | 0.000207186 | 0.000219207 | 0.000277519 | road_runner | RoadRunner | | 0.000333071 | 0.000369296 | 0.000570774 | sarcoma | sarcoma | | 0.000635624 | 0.000721800 | 0.001362801 | deep_space | DeepSpace | N = 10000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|------------------|------------------| | 0.002717972 | 0.002925014 | 0.003932238 | patrick_artner | Patrick_Artner | | 0.002796888 | 0.003489044 | 0.004799843 | road_runner | RoadRunner | | 0.004704714 | 0.005460148 | 0.008680582 | sarcoma | sarcoma | | 0.005549192 | 0.006385834 | 0.009561062 | deep_space | DeepSpace | N = 25000 ------ | min | avg | max | func | name | |---------------|---------------|---------------|------------------|------------------| | 0.010142803 | 0.011239243 | 0.015279770 | patrick_artner | Patrick_Artner | | 0.011211157 | 0.012368391 | 0.014696836 | road_runner | RoadRunner | | 0.014389753 | 0.015374193 | 0.022623777 | sarcoma | sarcoma | | 0.016021967 | 0.016560717 | 0.019297361 | deep_space | DeepSpace | | 找到

© www.soinside.com 2019 - 2024. All rights reserved.