我有两个数据框。第一个包含基因-基因相关矩阵,1484 x 1484(每个单元格对应于 I 和 J 基因之间的相关值)。第二个包含键 -> 值排序的信息,它看起来像这样:
Complex Protein_ID
1 BCL6-HDAC4 complex Bcl6
125 BCL6-HDAC5 complex Hdac5
249 BCL6-HDAC7 complex Bcl6
373 Multisubunit ACTR coactivator complex Ep300
497 Condensin I complex Smc2
621 BLOC-3 Hps4
我有兴趣从我的矩阵中提取属于同一复合体的基因的相关性,并将它们存储在一个新的数据框中,其中我将拥有每个复合体的基因-基因相关性的值。理想情况下它看起来像这样:
#this is a simulated data.frame
Complex Correlation values
BCL6-HDAC4 complex 0.64
BCL6-HDAC4 complex -0.25
Multisubunit ACTR coactivator complex 0.31
Multisubunit ACTR coactivator complex 0.30
关于如何到达那里有什么想法吗?
library(data.table) # >= V1.15.0
df <-
melt(data.table(cors), variable.name = "col", value.name="cor" # matrix to long df
)[, let(i = rleid(col), j = rowid(col))][, col := NULL # add i and j cols
][i < j # distinct correlations
][, let(Complex = lkps$Complex[i], Complex2 = lkps$Complex[j]) # get Complex for i and j
][Complex == Complex2][, Complex2 := NULL] # keep where same Complex
示例数据(10个基因,3组,仅显示相关矩阵的前6列):
set.seed(1)
n_genes <- 10
cors <- cor(matrix(rnorm(n_genes * 50), nrow = 50, ncol = n_genes))
lkps <- data.frame(
Complex = sample(c("Complex A", "Complex B", "Complex C"), n_genes, replace = TRUE),
Protein_ID = replicate(n_genes, paste0(sample(c(letters, LETTERS), 4, replace = TRUE), collapse = "")))
> cors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.00000000 -0.039087178 0.026287227 -0.27185574 0.013674895 -0.11933102
[2,] -0.03908718 1.000000000 0.003552006 -0.02391178 0.039833039 0.02218480
[3,] 0.02628723 0.003552006 1.000000000 0.21648782 0.127791868 0.12197135
[4,] -0.27185574 -0.023911775 0.216487818 1.00000000 -0.082713154 -0.24277681
[5,] 0.01367489 0.039833039 0.127791868 -0.08271315 1.000000000 0.09888519
[6,] -0.11933102 0.022184800 0.121971345 -0.24277681 0.098885194 1.00000000
[7,] 0.19468192 0.006755358 -0.074116195 0.12591453 0.184806771 -0.14283941
[8,] -0.14785348 -0.255064246 -0.054761988 -0.03252786 0.004459162 0.03851846
[9,] 0.02336706 0.198299294 0.069506207 0.14657036 0.183043022 -0.10887799
[10,] -0.36678892 0.240101899 0.031648477 0.17387651 0.131315992 -0.12944992
> lkps
Complex Protein_ID
1 Complex C jMXs
2 Complex C ruTw
3 Complex A zoCU
4 Complex C PCev
5 Complex A aWvm
6 Complex B vfRO
7 Complex A GxvG
8 Complex B jSsh
9 Complex B lkpQ
10 Complex B ufxz
结果:
cor i j Complex
<num> <int> <int> <char>
1: -0.03908718 1 2 Complex C
2: -0.27185574 1 4 Complex C
3: -0.02391178 2 4 Complex C
4: 0.12779187 3 5 Complex A
5: -0.07411620 3 7 Complex A
6: 0.18480677 5 7 Complex A
7: 0.03851846 6 8 Complex B
8: -0.10887799 6 9 Complex B
9: -0.12944992 6 10 Complex B
10: -0.05267148 8 9 Complex B
11: 0.04892611 8 10 Complex B
12: 0.18778267 9 10 Complex B