基于键->值数据框提取基因子集的相关性

问题描述 投票:0回答:1

我有两个数据框。第一个包含基因-基因相关矩阵,1484 x 1484(每个单元格对应于 I 和 J 基因之间的相关值)。第二个包含键 -> 值排序的信息,它看起来像这样:

                       Complex            Protein_ID
1                      BCL6-HDAC4 complex       Bcl6
125                    BCL6-HDAC5 complex      Hdac5
249                    BCL6-HDAC7 complex       Bcl6
373 Multisubunit ACTR coactivator complex      Ep300
497                   Condensin I complex       Smc2
621                                BLOC-3       Hps4

我有兴趣从我的矩阵中提取属于同一复合体的基因的相关性,并将它们存储在一个新的数据框中,其中我将拥有每个复合体的基因-基因相关性的值。理想情况下它看起来像这样:

#this is a simulated data.frame

                    Complex                                Correlation values
                    BCL6-HDAC4 complex                     0.64
                    BCL6-HDAC4 complex                     -0.25
                    Multisubunit ACTR coactivator complex  0.31
                    Multisubunit ACTR coactivator complex  0.30

关于如何到达那里有什么想法吗?

r key-value
1个回答
0
投票
library(data.table) # >= V1.15.0
df <-
  melt(data.table(cors), variable.name = "col", value.name="cor" # matrix to long df
  )[, let(i = rleid(col), j = rowid(col))][, col := NULL         # add i and j cols
  ][i < j                                                        # distinct correlations
  ][, let(Complex = lkps$Complex[i], Complex2 = lkps$Complex[j]) # get Complex for i and j
  ][Complex == Complex2][, Complex2 := NULL]                     # keep where same Complex

示例数据(10个基因,3组,仅显示相关矩阵的前6列):

set.seed(1)
n_genes <- 10
cors <- cor(matrix(rnorm(n_genes * 50), nrow = 50, ncol = n_genes))
lkps <- data.frame(
  Complex = sample(c("Complex A", "Complex B", "Complex C"), n_genes, replace = TRUE),
  Protein_ID = replicate(n_genes, paste0(sample(c(letters, LETTERS), 4, replace = TRUE), collapse = "")))

> cors
             [,1]         [,2]         [,3]        [,4]         [,5]        [,6]
 [1,]  1.00000000 -0.039087178  0.026287227 -0.27185574  0.013674895 -0.11933102
 [2,] -0.03908718  1.000000000  0.003552006 -0.02391178  0.039833039  0.02218480
 [3,]  0.02628723  0.003552006  1.000000000  0.21648782  0.127791868  0.12197135
 [4,] -0.27185574 -0.023911775  0.216487818  1.00000000 -0.082713154 -0.24277681
 [5,]  0.01367489  0.039833039  0.127791868 -0.08271315  1.000000000  0.09888519
 [6,] -0.11933102  0.022184800  0.121971345 -0.24277681  0.098885194  1.00000000
 [7,]  0.19468192  0.006755358 -0.074116195  0.12591453  0.184806771 -0.14283941
 [8,] -0.14785348 -0.255064246 -0.054761988 -0.03252786  0.004459162  0.03851846
 [9,]  0.02336706  0.198299294  0.069506207  0.14657036  0.183043022 -0.10887799
[10,] -0.36678892  0.240101899  0.031648477  0.17387651  0.131315992 -0.12944992

> lkps
     Complex Protein_ID
1  Complex C       jMXs
2  Complex C       ruTw
3  Complex A       zoCU
4  Complex C       PCev
5  Complex A       aWvm
6  Complex B       vfRO
7  Complex A       GxvG
8  Complex B       jSsh
9  Complex B       lkpQ
10 Complex B       ufxz

结果:

            cor     i     j   Complex
          <num> <int> <int>    <char>
 1: -0.03908718     1     2 Complex C
 2: -0.27185574     1     4 Complex C
 3: -0.02391178     2     4 Complex C
 4:  0.12779187     3     5 Complex A
 5: -0.07411620     3     7 Complex A
 6:  0.18480677     5     7 Complex A
 7:  0.03851846     6     8 Complex B
 8: -0.10887799     6     9 Complex B
 9: -0.12944992     6    10 Complex B
10: -0.05267148     8     9 Complex B
11:  0.04892611     8    10 Complex B
12:  0.18778267     9    10 Complex B
© www.soinside.com 2019 - 2024. All rights reserved.