我正在使用 GSEA 分析(来自
clusterProfiler
包)并希望执行前沿分析。为此,我需要从 gseaResult
中提取原始数据。
#FYI my code looks like this:
GSEA_GO <- gseGO(geneList=gene_list, keyType = "SYMBOL", OrgDb = org.Hs.eg.db)
View(data.frame(GSEA_GO@result))
#after extraction and data transformation, this is a reprex of what I end with:
#one letter being a gene name (included in the leading edge), and "GSx" being a gene set
GS1 <- c("a", "b", "c", "d", "e", "f")
GS2 <- c("b", "c", "d", "e", "f", "g")
GS3 <- c("a", "b", "c", NA,NA,NA)
GS4 <- c("a", "d", "e", "g", NA, NA)
GS5 <- c("a", "b", "c", "d", NA, NA)
df <- data.frame(rbind(GS1, GS2, GS3, GS4, GS5))
为了更进一步,我必须将此表转换为另一个表,其中每一列代表基因集中(即行)中基因的存在(= 1)或不存在(= 0)。它看起来像这样:
当然我有数百个基因,数百个基因组...... 我不想用 ifelse 手动完成所有事情...... 谁能提供一些走向正确方向的线索? 谢谢!
可能有更优雅的方法来做到这一点,但我会再次尝试熔化和铸造:
# create id column
df$id <- rownames(df)
# melted
df_melt <- df |>
data.table::as.data.table() |>
data.table::melt(id.vars = "id") |>
na.omit()
> head(df_melt)
id variable value
1: GS1 X1 a
2: GS2 X1 b
3: GS3 X1 a
4: GS4 X1 a
5: GS5 X1 a
6: GS1 X2 b
然后你可以再次抛投:
# wide
df_wide <- data.table::dcast(df_melt, id ~ value)
> df_wide
id a b c d e f g
1: GS1 a b c d e f <NA>
2: GS2 <NA> b c d e f g
3: GS3 a b c <NA> <NA> <NA> <NA>
4: GS4 a <NA> <NA> d e <NA> g
5: GS5 a b c d <NA> <NA> <NA>
然后您可以将所有列(不包括 id)突变为 1(如果存在)、0(如果不存在)。
# get letter column only
genes <- colnames(df)[colnames(df) != "id"]
# change all gene cols to be 1 if present, 0 if absent
df_wide[, (genes) := lapply(.SD, function(x) ifelse(is.na(x), 0, 1)), .SDcols = genes]
> df_wide
id a b c d e f g
1: GS1 1 1 1 1 1 1 0
2: GS2 0 1 1 1 1 1 1
3: GS3 1 1 1 0 0 0 0
4: GS4 1 0 0 1 1 0 1
5: GS5 1 1 1 1 0 0 0