查找重叠列表组

问题描述 投票:0回答:1

我有这样的列表数据的分类列表:

df=pd.DataFrame({'ID':['A','B','C','D','E','F'],'Destination':[[x,y],[m,n],[x,k],[x,k,y],[m],[p,h]])
| ID | Destination |
| -------- | -------- |
| A   | [x,y]   |
| B   | [m,n]   |
| C   | [x,k]   |
| D   | [x,k,y]   |
| E   | [m]   |
| F   | [p,h]   |

我想对所有重叠列表(或元组,如果有帮助)聚集 Destination。

我希望得到这样的结果:

| Destination_Group| ID_Group |
| -------- | -------- |
| [x,k,y]   | [A,C,D]   |
| [m,n]   | [B,E]   |
| [p,h]   | [F]   |

我不知道有多少个Destination_Group。数据表比 6 行长得多,所以我想尽可能避免迭代方法。

我不知道 k 均值聚类或笛卡尔合并是否会对我有帮助,或者我应该寻找其他东西?我感谢任何帮助!

python list cluster-analysis
1个回答
0
投票
import collections
import pandas as pd

df = pd.DataFrame({'ID':['A','B','C','D','E','F'],'Destination':[['x','y'],['m','n'],['x','k'],['x','k','y'],['m'],['p','h']]})

clusters = collections.defaultdict(list)

# invert the dataframe so you can find overlapping IDs
for _i,row in df.iterrows():
    for d in row['Destination']:
        clusters[d].append(row['ID'])

answer = []  # cluster'd IDs
seen = set()  # IDs that we've already processed
for keys in sorted(clusters.values(), key=len, reverse=True):
    if any(key in seen for key in keys):  # we're doing this from the biggest cluster, so if we've already seen this, it was already in some cluster we've already processed
        continue

    for k in keys:
        seen.add(k)

    answer.append(keys)

# now that we have all the clusters, create the new dataframe
keys = answer
answer = {"Destination_Group":[] , "ID_Group":[]}
for cluster in keys:
    answer["ID_Group"].append(cluster)
    s = set()
    for k in cluster:
        s.update(df[df.ID==k]['Destination'].reset_index()['Destination'][0])

    answer['Destination_Group'].append(s)

answer["ID_Group"] = [sorted(s) for s in answer["ID_Group"]]

answer = pd.DataFrame(answer)

你会得到这个结果:

  Destination_Group   ID_Group
0         {k, y, x}  [A, C, D]
1            {m, n}     [B, E]
2            {h, p}        [F]
© www.soinside.com 2019 - 2024. All rights reserved.