我有两个 pandas 数据框,其中每一行都是一个人,他们的响应数据以列表的形式:
df_1 = pd.DataFrame({'ID': ['a', 'b', 'c', 'd', 'e', 'f'], 'response': [["apple", "berry", "cherry"],
["pear", "pineapple", "plum"],
["blue_berry"],
["orange", "lemon"],
["tomato", "pumpkin"],
["avocado", "strawberry"]], 'group': [1, 2, 1, 2, 1, 2]})
df_2 = pd.DataFrame({'ID': ['A', 'B','C', 'D', 'E', 'F'], 'response': [["pear", "plum", "cherry"],
["orange", "lemon", "lime", "pineapple"],
["pumpkin"],
["tomato", "strawberry"],
["avocado", "apple"],
["berry", "cherry", "apple"]], 'group': [1, 2, 1, 2, 1, 2]})
我正在尝试构建一个矩阵,其中每个列和行索引都是
ID
和 group
,但矩阵的每个单元格都是从 response
计算出的成对 Jensen-Shannon 散度分数。我的最终目标是将其可视化为热图,以评估人们响应之间的可靠性,但首先我努力将我的数据放入正确的矩阵形式。
我不知道如何将这些数据帧转换为正方形,然后使用以下函数计算 JSD:
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))
首先,您需要合并您拥有的两个数据框。我建议采用以下方法
import pandas as pd
import numpy as np
from scipy.spatial.distance import jensenshannon
from itertools import combinations
df_1 = pd.DataFrame({'ID': ['a', 'b', 'c', 'd', 'e', 'f'],
'response': [["apple", "berry", "cherry"],
["pear", "pineapple", "plum"],
["blue_berry"],
["orange", "lemon"],
["tomato", "pumpkin"],
["avocado", "strawberry"]],
'group': [1, 2, 1, 2, 1, 2]})
df_2 = pd.DataFrame({'ID': ['A', 'B', 'C', 'D', 'E', 'F'],
'response': [["pear", "plum", "cherry"],
["orange", "lemon", "lime", "pineapple"],
["pumpkin"],
["tomato", "strawberry"],
["avocado", "apple"],
["berry", "cherry", "apple"]],
'group': [1, 2, 1, 2, 1, 2]})
df_combined = pd.concat([df_1, df_2], axis=0).reset_index(drop=True)
all_fruits = set([fruit for sublist in pd.concat([df_1['response'], df_2['response']]).tolist() for fruit in sublist])
def response_to_prob_dist(response, all_fruits):
fruit_count = {fruit: response.count(fruit) / len(response) for fruit in all_fruits}
return [fruit_count[fruit] if fruit in response else 0 for fruit in all_fruits]
df_combined['prob_dist'] = df_combined['response'].apply(lambda x: response_to_prob_dist(x, all_fruits))
它为您提供以下类型的数据框:
ID response group \
0 a [apple, berry, cherry] 1
1 b [pear, pineapple, plum] 2
2 c [blue_berry] 1
3 d [orange, lemon] 2
4 e [tomato, pumpkin] 1
prob_dist
0 [0, 0, 0, 0, 0.3333333333333333, 0, 0, 0, 0, 0...
1 [0, 0, 0, 0.3333333333333333, 0, 0, 0, 0, 0.33...
2 [0, 0, 1.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 [0, 0.5, 0, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0]
4 [0.5, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0]
您可以应用您的功能,
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))
n = len(df_combined)
jsd_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
jsd_matrix[i, j] = jsdiv(df_combined['prob_dist'].iloc[i], df_combined['prob_dist'].iloc[j])
jsd_matrix[j, i] = jsd_matrix[i, j]
jsd_df = pd.DataFrame(jsd_matrix, index=df_combined['ID'], columns=df_combined['ID'])
jsd_df.head()
这会给你
ID a b c d e f A B C D E \
ID
a 0.0 1.0 1.0 1.0 1.0 1.0 0.666667 1.000000 1.000000 1.0 0.595437
b 1.0 0.0 1.0 1.0 1.0 1.0 0.333333 0.712642 1.000000 1.0 1.000000
c 1.0 1.0 0.0 1.0 1.0 1.0 1.000000 1.000000 1.000000 1.0 1.000000
d 1.0 1.0 1.0 0.0 1.0 1.0 1.000000 0.311278 1.000000 1.0 1.000000
e 1.0 1.0 1.0 1.0 0.0 1.0 1.000000 1.000000 0.311278 0.5 1.000000
ID F
ID
a 0.0
b 1.0
c 1.0
d 1.0
e 1.0
但是,我不明白你的功能。你知道你可以使用
def jsdiv(P, Q):
return jensenshannon(P, Q, base=2)**2
n = len(df_combined)
jsd_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
jsd_matrix[i, j] = jsdiv(df_combined['prob_dist'].iloc[i], df_combined['prob_dist'].iloc[j])
jsd_matrix[j, i] = jsd_matrix[i, j]
jsd_df = pd.DataFrame(jsd_matrix, index=df_combined['ID'], columns=df_combined['ID'])
print(jsd_df.head())
直接吧?
这会给你
ID a b c d e f A B C D E \
ID
a 0.0 1.0 1.0 1.0 1.0 1.0 0.666667 1.000000 1.000000 1.0 0.595437
b 1.0 0.0 1.0 1.0 1.0 1.0 0.333333 0.712642 1.000000 1.0 1.000000
c 1.0 1.0 0.0 1.0 1.0 1.0 1.000000 1.000000 1.000000 1.0 1.000000
d 1.0 1.0 1.0 0.0 1.0 1.0 1.000000 0.311278 1.000000 1.0 1.000000
e 1.0 1.0 1.0 1.0 0.0 1.0 1.000000 1.000000 0.311278 0.5 1.000000
ID F
ID
a 0.0
b 1.0
c 1.0
d 1.0
e 1.0
你对
jsdiv
的定义让事情过于复杂了。