从键值对的元组列表中获取具有最少计数的项的键 - Python

Question

输入是未排序的元组列表：

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

目标是找到最少计数的最后一个n密钥，即所需的输出：

['orh', 'si', 'tai', 'titlo', 'da']

我试过这样做：

首先将元组列表转换为字典
将字典投入计数器
然后从[-n:]找到Counter.most_common()元组列表
将[-n:]中的元组列表转换为dict
获取密钥，然后将其转换为列表

即

n = 5
list(dict(Counter(dict(x)).most_common()[-n:]).keys())

是否有一种不太复杂的方式来获得相同的输出？

我也可以这样做：

from operator import itemgetter
output, *_ = zip(*sorted(x, key=itemgetter(1))[n:])
list(output)

但现在我只是用Counter.most_common和sorted换掉了itemgetter。然后我仍然需要zip(*list)通过从zip后的每个元组列表中解压缩第一个值来提取密钥。

必须有一个更简单的方法。

NOTE

请注意，问题不是要求排序，而是在给定的元组的原始列表中提取列表中的第一个元素。并且提取的标准基于第二个元素中具有最低值的最后第n个项目。

answers from the possible duplicate linked仍然需要步骤来解包已排序的元组列表，并提取第一个元素列表的前n个。

Answer 1

目标是找到计数最少的最后一个n键

鉴于这个目标的定义，你的两个解决方案都不合适。在使用Counter的一个中，你使用dict，这将使键的顺序未定义，你将不会获得最后的键，但一些n键具有最小值。第二个解决方案有不正确的切片，如果它被修复，它将返回具有最小值的第一个n键。

考虑到sorted的实现是stable，它可以像这样重写以适应目标：

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

但是使用heapq是一个更好的主意，nlargest是stdlib工具，用于解答诸如“来自可迭代的n个最小/最大值”之类的问题（正如Martijn Pieters所指出的那样，nsmallest和n也是稳定的，而且文档真的这么说，但是隐含办法）。特别是如果您必须处理的真实列表很大（对于小sorted它应该更快docs describe作为def prop_1(): rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1]) return [item[0] for item in rev_result][::-1]）。

您可以进一步提高性能，但是以订单为代价（排序稳定性），即一些n键具有最小值而不是最后值的最后list键。要做到这一点，你需要保留一个“堆化”列表并将其用作内部数据结构而不是普通的_p2_heap = None def prop_2(): global _p2_heap if not _p2_heap: _p2_heap = [] for item in l: heapq.heappush(_p2_heap, item[::-1]) return [item[1] for item in heapq.nsmallest(n, _p2_heap)]（如果你不更改列表并且只需要一次底部n，它将不会带来性能优势）。您可以从列表中推送和弹出，例如：

import heapq
from collections import Counter  

l = [
    ('herr', 1), ('dapao', 1),
    ('cino', 1), ('o', 38),
    ('tiao', 2), ('tut', 1),
    ('poh', 6), ('micheal', 1),
    ('orh', 1), ('horlick', 3),
    ('si', 1), ('tai', 1),
    ('titlo', 1), ('siew', 17),
    ('da', 1), ('halia', 2)
]
n = 5    

def author_1():
    return list(dict(Counter(dict(l)).most_common()[-n:]).keys())

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

_p2_heap = None    
def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]

这是您可以用来对解决方案进行基准测试的完整模块。

timeit

以下是$ python -m timeit -s "import tst" "tst.author_1()" 100000 loops, best of 3: 7.72 usec per loop $ python -m timeit -s "import tst" "tst.author_2()" 100000 loops, best of 3: 3.7 usec per loop $ python -m timeit -s "import tst" "tst.prop_1()" 100000 loops, best of 3: 5.51 usec per loop $ python -m timeit -s "import tst" "tst.prop_2()" 100000 loops, best of 3: 3.96 usec per loop的结果：

l = l * 1000

但如果我们制作$ python -m timeit -s "import tst" "tst.author_1()" 1000 loops, best of 3: 263 usec per loop $ python -m timeit -s "import tst" "tst.author_2()" 100 loops, best of 3: 2.72 msec per loop $ python -m timeit -s "import tst" "tst.prop_1()" 1000 loops, best of 3: 1.65 msec per loop $ python -m timeit -s "import tst" "tst.prop_2()" 1000 loops, best of 3: 767 usec per loop，差异就会变得明显：

import heapq

x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]

heap = [(item[1],-index,item[0]) for index, item in enumerate(x)]
heapq.heapify(heap)

print(list(map(lambda item : item[2], heapq.nsmallest(5, heap))))

Answer 2

Appe Py Pyung So Chi

由于我们试图按照从最小到最大的顺序找到O(n)元素，我们不能简单地过滤掉那些没有最小第二元素的元素。我们还有第二个尝试维护顺序的目标 - 这消除了仅仅排序每个元组的第二个元素。

我的解决方案有复杂性set - 这是你在这里可以做的最好的，因为我们正在创建一个依赖于预先存在的列表的新列表。

它的工作原理是在n中创建x（无序）的每个元组的第一个x元素 - 在[::-1]被反转（set）之后，然后根据第二个元素进行排序。这有一个巧妙的技巧，因为我们在转换为集合之前进行切片，在这些元组中仍然存在具有等效第二元素的顺序。

现在，使用O(1)的整洁性是查找是hashes（即时），因为元素按照__contains__的顺序存储，因此调用list比使用list-comprehension快得多。

我们终于需要使用x来执行>>> n = 5 >>> s = {i[0] for i in sorted(x[::-1], key=lambda t: t[1])[:n]} >>> [i for i, _ in x if i in s] ['orh', 'si', 'tai', 'titlo', 'da']的最终过滤：

n = 11

也是一个测试，表明它与['herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da', 'halia']一起使用

list comprehension

Answer 3

使用sorted和[key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:]：

[key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1]

要么

n

其中[::-1]是你想要的结果中的键数。请注意，使用后者使用from timeit import default_timer def timeit(method, *args, **kwargs): start = default_timer() result = method(*args, **kwargs) end = default_timer() print('%s:\n(timing: %fs)\n%s\n' % (method.__name__, (end - start), result)) def with_copy(x, n): return [key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1] def without_copy(x, n): return [key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:] x = [('herr', 1), ('dapao', 1), ('cino', 1), ('o', 38), ('tiao', 2), ('tut', 1), ('poh', 6), ('micheal', 1), ('orh', 1), ('horlick', 3), ('si', 1), ('tai', 1), ('titlo', 1), ('siew', 17), ('da', 1), ('halia', 2)] n = 5 timeit(with_copy, x, n) timeit(without_copy, x, n) n = 11 timeit(with_copy, x, n) timeit(without_copy, x, n)会更加昂贵，因为它会再次对列表进行切片以将其反转。

n = 5

Results with `with_copy: (timing: 0.000026s) ['orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000018s) ['orh', 'si', 'tai', 'titlo', 'da']`:

n = 11

Results with `with_copy: (timing: 0.000019s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000013s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']`:

import numpy as np 
n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

x = np.array(x)  # make the list a numpy array
names = x[:, 0]   
numbers = x[:, 1].astype(int)
least_count = np.take(names, np.where(numbers == np.min(numbers)))[0][-n:]
print(least_count)

Answer 4

无需在此解决方案中进行排序
小解决方案： ['orh', 'si', 'tai', 'titlo', 'da']
以上解决方案的输出： import numpy as np x = [('herr', 1), ('dapao', 1), ('cino', 1), ('o', 38), ('tiao', 2), ('tut', 1), ('poh', 6), ('micheal', 1), ('orh', 1), ('horlick', 3), ('si', 1), ('tai', 1), ('titlo', 1), ('siew', 17), ('da', 1), ('halia', 2)] x = np.array(x) # make the list a numpy array # ========================================== # split the array into names and numbers # ========================================== names = x[:, 0] numbers = x[:, 1].astype(int) mini = np.min(numbers) # find the minimum in the numbers array idx = np.where(numbers == mini) # Find the indices where minimum occurs in the numbers array least_count = np.take(names, idx)[0] # Use the indices found from numbers array in the above line to access names array print(least_count) least_count = least_count.tolist() # to convert the numpy array to list n = 5 # say n is 5 print(least_count[-n:]) # now you can do simple slicing to extract the last n element
带注释的解决方案说明 ['herr' 'dapao' 'cino' 'tut' 'micheal' 'orh' 'si' 'tai' 'titlo' 'da'] ['orh', 'si', 'tai', 'titlo', 'da']
以上说明的输出： qazxsw poi

Answer 5

只需使用堆，它将为您提供所需的输出。

heapq.nsmallest(n, iterable, key=None)

-index有一个关键的论点，你可以像我一样使用[k for k,v in sorted(x, key=lambda x: x[1])[:n]]。

Answer 6

x

其中n是密钥列表，计数元组和[k for k,v in sorted(x, key=lambda x: (x[1], x[0]))[:n]]是所需的密钥数。

您还可以调整排序条件以包括密钥本身 - 如果它们的顺序很重要

mi = min(x, key =lambda x:x[1])[1] r = [a[0] for a in x if a[1] == mi][-5:]

Answer 7

编辑@alvas：

sorted(x, key=lambda x: x[1])

会产生你想要的输出

你可以用这个：

Sort a list of tuples by 2nd item (integer value)

请参考此（可能重复）

pandas

Answer 8

如果您不想重新发明轮子，可以使用df = pd.DataFrame(x, columns=['name', 'count']) df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n) print df['name'].tolist()。性能应该很好，因为它基于NumPy，它使用C而不是纯Python。

Short answer

['orh', 'si', 'tai', 'titlo', 'da']

Result

import pandas as pd

n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

# Put the data in a dataframe.
df = pd.DataFrame(x, columns=['name', 'count'])

# Get the last n rows having the smallest 'count'.
# Mergesort is used instead of quicksort (default) since a stable sort is needed
# to get the *last* n smallest items instead of just *any* n smallest items.
df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n)

# Print the 'name' column as a list (since a list is what you asked for).
print df['name'].tolist()

Expanded, working example with comments

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

n = 5
result = [name[0] for name in sorted(x, key=lambda i: i[1], reverse=True)[-n:]]
print(result)

Answer 9

[i [0] for i in sorted（x .__ reverse __（），key = lambda x：x [1]）[：n]]

与@Stacksonstacks几乎完全一样的答案，只是这实际上给你'期望的输出'（如果你把n = 5）

Answer 10

您不需要为此任务执行任何导入，您也可以通过以下方式执行此操作：

['orh', 'si', 'tai', 'titlo', 'da']

输出：

n = 5
output=[]

# Search and store the n least numbers
leastNbs = [a[1] for a in sorted(x, key=lambda x: x[1])[:n]]

# Iterate over the list of tuples starting from the end
# in order to find the tuples including one of the n least numbers
for x,nb in reversed(x):
    if nb in leastNbs:
        output.append(x)  # Store the string in output
        print(x)

# Keep only the n last strings (starting from the end)
output = list(reversed(output[:n]))

print(output)

Answer 11

这是我的建议：

m = x[0][1]
l = []

for elem in x:
    if m > elem[1]:
        l = [elem[0]]
        m = elem[1]
    elif m == elem[1]:
        l.append(elem[0])

print(l[-5:])

Answer 12

这是一个干净，简单的方法，不使用python习语：

它有点像最小值搜索和过滤的融合。 l存储最小值到目前为止，n存储具有最小计数的元素列表。找到较低的值时重置它们。

这可以修改为仅容纳5个元素，因此最终不需要拼接。

从键值对的元组列表中获取具有最少计数的项的键 - Python

问题描述投票：6回答：12

NOTE

12个回答

Results with `with_copy: (timing: 0.000026s) ['orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000018s) ['orh', 'si', 'tai', 'titlo', 'da']`:

Results with `with_copy: (timing: 0.000019s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000013s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']`:

Short answer

Result

Expanded, working example with comments

最新问题

从键值对的元组列表中获取具有最少计数的项的键 - Python

问题描述 投票：6回答：12

NOTE

12个回答

Results with with_copy: (timing: 0.000026s) ['orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000018s) ['orh', 'si', 'tai', 'titlo', 'da'] :

Results with with_copy: (timing: 0.000019s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000013s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da'] :

Short answer

Result

Expanded, working example with comments

最新问题

问题描述投票：6回答：12

Results with `with_copy: (timing: 0.000026s) ['orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000018s) ['orh', 'si', 'tai', 'titlo', 'da']`:

Results with `with_copy: (timing: 0.000019s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da'] without_copy: (timing: 0.000013s) ['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']`: