我有这样的列表数据的分类列表:
df=pd.DataFrame({'ID':['A','B','C','D','E','F'],'Destination':[[x,y],[m,n],[x,k],[x,k,y],[m],[p,h]])
| ID | Destination |
| -------- | -------- |
| A | [x,y] |
| B | [m,n] |
| C | [x,k] |
| D | [x,k,y] |
| E | [m] |
| F | [p,h] |
我想对所有重叠列表(或元组,如果有帮助)聚集 Destination。
我希望得到这样的结果:
| Destination_Group| ID_Group |
| -------- | -------- |
| [x,k,y] | [A,C,D] |
| [m,n] | [B,E] |
| [p,h] | [F] |
我不知道有多少个Destination_Group。数据表比 6 行长得多,所以我想尽可能避免迭代方法。
我不知道 k 均值聚类或笛卡尔合并是否会对我有帮助,或者我应该寻找其他东西?我感谢任何帮助!
import collections
import pandas as pd
df = pd.DataFrame({'ID':['A','B','C','D','E','F'],'Destination':[['x','y'],['m','n'],['x','k'],['x','k','y'],['m'],['p','h']]})
clusters = collections.defaultdict(list)
# invert the dataframe so you can find overlapping IDs
for _i,row in df.iterrows():
for d in row['Destination']:
clusters[d].append(row['ID'])
answer = [] # cluster'd IDs
seen = set() # IDs that we've already processed
for keys in sorted(clusters.values(), key=len, reverse=True):
if any(key in seen for key in keys): # we're doing this from the biggest cluster, so if we've already seen this, it was already in some cluster we've already processed
continue
for k in keys:
seen.add(k)
answer.append(keys)
# now that we have all the clusters, create the new dataframe
keys = answer
answer = {"Destination_Group":[] , "ID_Group":[]}
for cluster in keys:
answer["ID_Group"].append(cluster)
s = set()
for k in cluster:
s.update(df[df.ID==k]['Destination'].reset_index()['Destination'][0])
answer['Destination_Group'].append(s)
answer["ID_Group"] = [sorted(s) for s in answer["ID_Group"]]
answer = pd.DataFrame(answer)
你会得到这个结果:
Destination_Group ID_Group
0 {k, y, x} [A, C, D]
1 {m, n} [B, E]
2 {h, p} [F]