使用 Jensen-Shannon Divergence 的分半可靠性

Question

我有两个 pandas 数据框，其中每一行都是一个人，他们的响应数据以列表的形式：

df_1 = pd.DataFrame({'ID': ['a', 'b', 'c', 'd', 'e', 'f'], 'response': [["apple", "berry", "cherry"],
             ["pear", "pineapple", "plum"],
             ["blue_berry"],
             ["orange", "lemon"],
             ["tomato", "pumpkin"],
             ["avocado", "strawberry"]], 'group': [1, 2, 1, 2, 1, 2]})

df_2 = pd.DataFrame({'ID': ['A', 'B','C', 'D', 'E', 'F'], 'response': [["pear", "plum", "cherry"],
                 ["orange", "lemon", "lime", "pineapple"],
                 ["pumpkin"],
                 ["tomato", "strawberry"],
                 ["avocado", "apple"],
                 ["berry", "cherry", "apple"]], 'group': [1, 2, 1, 2, 1, 2]})

我正在尝试构建一个矩阵，其中每个列和行索引都是

ID

和

group

，但矩阵的每个单元格都是从

response

计算出的成对 Jensen-Shannon 散度分数。我的最终目标是将其可视化为热图，以评估人们响应之间的可靠性，但首先我努力将我的数据放入正确的矩阵形式。

我不知道如何将这些数据帧转换为正方形，然后使用以下函数计算 JSD：

def jsdiv(P, Q):
    """Compute the Jensen-Shannon divergence between two probability distributions.

    Input
    -----
    P, Q : array-like
        Probability distributions of equal length that sum to 1
    """

    def _kldiv(A, B):
        return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])

    P = np.array(P)
    Q = np.array(Q)

    M = 0.5 * (P + Q)

    return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))

Answer 1

首先，您需要合并您拥有的两个数据框。我建议采用以下方法

import pandas as pd
import numpy as np
from scipy.spatial.distance import jensenshannon
from itertools import combinations

df_1 = pd.DataFrame({'ID': ['a', 'b', 'c', 'd', 'e', 'f'], 
                     'response': [["apple", "berry", "cherry"],
                                  ["pear", "pineapple", "plum"],
                                  ["blue_berry"],
                                  ["orange", "lemon"],
                                  ["tomato", "pumpkin"],
                                  ["avocado", "strawberry"]], 
                     'group': [1, 2, 1, 2, 1, 2]})

df_2 = pd.DataFrame({'ID': ['A', 'B', 'C', 'D', 'E', 'F'], 
                     'response': [["pear", "plum", "cherry"],
                                  ["orange", "lemon", "lime", "pineapple"],
                                  ["pumpkin"],
                                  ["tomato", "strawberry"],
                                  ["avocado", "apple"],
                                  ["berry", "cherry", "apple"]], 
                     'group': [1, 2, 1, 2, 1, 2]})

df_combined = pd.concat([df_1, df_2], axis=0).reset_index(drop=True)

all_fruits = set([fruit for sublist in pd.concat([df_1['response'], df_2['response']]).tolist() for fruit in sublist])

def response_to_prob_dist(response, all_fruits):
    fruit_count = {fruit: response.count(fruit) / len(response) for fruit in all_fruits}
    return [fruit_count[fruit] if fruit in response else 0 for fruit in all_fruits]

df_combined['prob_dist'] = df_combined['response'].apply(lambda x: response_to_prob_dist(x, all_fruits))

它为您提供以下类型的数据框：

  ID                 response  group  \
0  a   [apple, berry, cherry]      1   
1  b  [pear, pineapple, plum]      2   
2  c             [blue_berry]      1   
3  d          [orange, lemon]      2   
4  e        [tomato, pumpkin]      1   

                                           prob_dist  
0  [0, 0, 0, 0, 0.3333333333333333, 0, 0, 0, 0, 0...  
1  [0, 0, 0, 0.3333333333333333, 0, 0, 0, 0, 0.33...  
2       [0, 0, 1.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  
3     [0, 0.5, 0, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0]  
4     [0.5, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0]

您可以应用您的功能，

def jsdiv(P, Q):
    """Compute the Jensen-Shannon divergence between two probability distributions.

    Input
    -----
    P, Q : array-like
        Probability distributions of equal length that sum to 1
    """

    def _kldiv(A, B):
        return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])

    P = np.array(P)
    Q = np.array(Q)

    M = 0.5 * (P + Q)

    return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))

n = len(df_combined)
jsd_matrix = np.zeros((n, n))  

for i in range(n):
    for j in range(i+1, n):  
        jsd_matrix[i, j] = jsdiv(df_combined['prob_dist'].iloc[i], df_combined['prob_dist'].iloc[j])
        jsd_matrix[j, i] = jsd_matrix[i, j]  

jsd_df = pd.DataFrame(jsd_matrix, index=df_combined['ID'], columns=df_combined['ID'])

jsd_df.head()

这会给你

ID    a    b    c    d    e    f         A         B         C    D         E  \
ID                                                                              
a   0.0  1.0  1.0  1.0  1.0  1.0  0.666667  1.000000  1.000000  1.0  0.595437   
b   1.0  0.0  1.0  1.0  1.0  1.0  0.333333  0.712642  1.000000  1.0  1.000000   
c   1.0  1.0  0.0  1.0  1.0  1.0  1.000000  1.000000  1.000000  1.0  1.000000   
d   1.0  1.0  1.0  0.0  1.0  1.0  1.000000  0.311278  1.000000  1.0  1.000000   
e   1.0  1.0  1.0  1.0  0.0  1.0  1.000000  1.000000  0.311278  0.5  1.000000   

ID    F  
ID       
a   0.0  
b   1.0  
c   1.0  
d   1.0  
e   1.0

但是，我不明白你的功能。你知道你可以使用

def jsdiv(P, Q):
    
    return jensenshannon(P, Q, base=2)**2  

n = len(df_combined)
jsd_matrix = np.zeros((n, n))  

for i in range(n):
    for j in range(i+1, n):  
        jsd_matrix[i, j] = jsdiv(df_combined['prob_dist'].iloc[i], df_combined['prob_dist'].iloc[j])
        jsd_matrix[j, i] = jsd_matrix[i, j]  

jsd_df = pd.DataFrame(jsd_matrix, index=df_combined['ID'], columns=df_combined['ID'])

print(jsd_df.head())

直接吧？

这会给你

ID    a    b    c    d    e    f         A         B         C    D         E  \
ID                                                                              
a   0.0  1.0  1.0  1.0  1.0  1.0  0.666667  1.000000  1.000000  1.0  0.595437   
b   1.0  0.0  1.0  1.0  1.0  1.0  0.333333  0.712642  1.000000  1.0  1.000000   
c   1.0  1.0  0.0  1.0  1.0  1.0  1.000000  1.000000  1.000000  1.0  1.000000   
d   1.0  1.0  1.0  0.0  1.0  1.0  1.000000  0.311278  1.000000  1.0  1.000000   
e   1.0  1.0  1.0  1.0  0.0  1.0  1.000000  1.000000  0.311278  0.5  1.000000   

ID    F  
ID       
a   0.0  
b   1.0  
c   1.0  
d   1.0  
e   1.0

你对

jsdiv

的定义让事情过于复杂了。

使用 Jensen-Shannon Divergence 的分半可靠性

问题描述投票：0回答：1

1个回答

最新问题

使用 Jensen-Shannon Divergence 的分半可靠性

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1