上下文:我有一个组织ID列表,其中每个组织ID具有多个帐户ID和电子邮件对。每封电子邮件与每个组织的一个帐户ID(组织ID)相关联(唯一)。并非所有电子邮件都在每个组织中,但是有些电子邮件在多个组织中甚至所有组织中。例如,如果有5个组织,则每个组织都有未知数量的帐户ID(电子邮件对)。帐户ID不论与哪个组织相关联都是唯一的,但是在多个组织中有一些电子邮件与不同的帐户ID相关联。]
我的数据采用以下结构,我正尝试在python中执行此操作:
# Note: Each AccountID Value is unique across the board # Note: Emails are unique per organization, but can be in multiple organizations. [ [ # The value for OrganizationID is the same throughout the list of dictionaries. { "some-email A": "AccountID", "OrganizationID": "Organization A" # <- The ID is just a string of numbers. }, { "some-email B": "AccountID", "OrganizationID": "Organization A" }, { "some-email C": "AccountID", "OrganizationID": "Organization A" }, ... ], ... [ { "some-email C": "AccountID", #. <- Also in organization A but different Account ID "OrganizationID": "Organization LK" }, { "some-email K": "AccountID", "OrganizationID": "Organization LK" }, ... ], ... ]
顺序无所谓!我的最终目标是将其转换为以下新的数据结构。
# Note: Reference is just a list of strings where each string is # a concatenation of the "OrganizationID:AccountID" of the respective email. [ { "Email": "some-email A", "Reference": [ "[Organization A]:[Account ID of "some-email A" in Organization A if exists]", ... "[Organization X]:[Account ID of "some-email A" in Organization X if exists]", ... ] }, ... { "Email": "some-email C", "Reference": [ "[Organization A]:[Account ID of "some-email C" in Organization A if exists]", ... "[Organization LK]:[Account ID of "some-email C" in Organization LK if exists]", ... ] }, ]
我当前的数据集有1000多个组织,每个组织都有任意数量的帐户。一些组织可能只有一个或两个帐户,而其他组织则有600多个帐户。没有组织拥有零帐户。
编辑:我当前的解决方案如下:但是我想看看是否有更有效的方法来解决这个问题。
re = list()
seen = set()
for _p in dt: # <- this is the first data set list(list(dict()))
for x in _p: # <- Each dictionary in the list(dict())
em = list(x.keys())[1] # <- some-email key
if em not in seen:
seen.add(em)
re.append({
"Email": em,
"Reference": [x["OrganizationID"] + ":" + x[em]]
})
else:
d = next(i for i in re if i['Email'] == em)
d["Reference"].append(x["OrganizationID"] + ":" + x[em])
上下文:我有一个组织ID列表,其中每个组织ID具有多个帐户ID和电子邮件对。每封电子邮件与每个组织的一个帐户ID(组织ID)相关联(唯一)。 ...
由于数据的结构方式,您正在做的事情将需要嵌套的for循环,但是我认为,如果删除if em not in seen
子句,您将获得更好的性能,因为这需要它自己遍历一个不存在的集合。不必首先创建集合就可以减少开销。这是我的方法: