仅对列表中的元素进行一次计数:如何优化性能?非常慢

问题描述 投票:0回答:1

我有一个很大的列表,其中包含用户名(约60,000个字符串)。每个用户名代表一个提交。某些用户仅提交了一个提交,即他们是“一次性用户”,因此其用户名在此列表中仅出现一次。其他人则进行了多次提交(回头用户),因此其用户名可以在此列表中多次出现。我想计算一下这些一次性用户的数量,并据此获得一些统计信息。这是我当前正在获取的变量:

import time

start_time = time.time()

users = ["UserA", "UserB", "UserC", "UserA", "UserA", "UserA", "UserB", "UserB", "UserD"] # ...just a sample, this goes up to ~60,000 elements
print(f"1: got users list. Time elapsed: {time.time() - start_time}")

one_time_users = [user for user in users if users.count(user) == 1]
print(f"2: got one-time users list. Time elapsed: {time.time() - start_time}")

returning_users = [user for user in users if users.count(user) != 1]
print(f"4: got returning users list. Time elapsed: {time.time() - start_time}")

frequencies = [users.count(user) for user in set(users)]
print(f"5: calculated frequencies list. Time elapsed: {time.time() - start_time}")

sorted_frequencies = sorted(frequencies, reverse=True) # Descending order, largest first
print(f"6: got sorted frequencies list. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_sum = sum(sorted_frequencies[:10])
print(f"7: got top 10 frequencies sum. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_percentage = round(((top_ten_frequencies_sum / len(users)) * 100), 2)
print(f"8: got top 10 frequencies percentage. Time elapsed: {time.time() - start_time}")

average_submissions_per_user = round(len(users) / len(set(users)), 2)
print(f"9: got average submissions per user. Time elapsed: {time.time() - start_time}")

此操作非常慢。这是我的输出:

1: got users list. Time elapsed: 0.41695237159729004
2: got one-time users list. Time elapsed: 48.26731848716736
3: got returning users list. Time elapsed: 101.88410639762878
4: calculated frequencies list. Time elapsed: 104.39784860610962
5: got sorted frequencies list. Time elapsed: 104.39850783348083
6: got top 10 frequencies sum. Time elapsed: 104.39853930473328
7: got top 10 frequencies percentage. Time elapsed: 104.39856457710266
8: got average submissions per user. Time elapsed: 104.4005241394043

您可以看到列表解析花费最多的时间。有人可以向我解释:

  1. 为什么时间复杂度如此之慢。
  2. collections.Counter()是否将是更好的选择,以及如何在此处最好地应用它。

谢谢!

python python-3.x list time-complexity list-comprehension
1个回答
0
投票

您可以通过使用计数器来进行改进,在2.中为每个元素迭代整个列表,如果一个用户出现多次,则对同一用户执行多次此操作。在4.中,您将进行迭代并再次计数,而您可以从整个用户中删除刚刚计算出的那些。范例。

>>> one_time_users = {user for user,cnt in Counter(users).items() if cnt == 1}
{'UserC', 'UserD'}
>>> returning_users = set(users) - one_time_users
>>> returning_users
{'UserB', 'UserA'}

或更直接地

one_time_users, returning_users  = [], []
for user,cnt in Counter(users).items():
   if cnt==1:
      one_time_users.append(user)
   else:
      returning_users.append(user)
© www.soinside.com 2019 - 2024. All rights reserved.