将data.frame列拆分为其他列

Question

我有一个很大的data.frame有一些列，但我的第9列是由分号分隔的数据：

    gtf$V9
1                 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;

所以我想把这个列切换到其他列和merge这个稍后用data.frame的其他部分（第9列之前的其他列）。

我尝试了一些没有结果的代码：

head(gtf$V9, sep = ";",stringsAsFactors = FALSE)

要么

new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"

与as.data.frame，data.frame或as.matrix相同

我也尝试过qazxsw poi并将其导入包括qazxsw poi，但write.csv太大而且我的电脑滞后..

有什么建议？

Answer 1

另一种选择是使用sep=";"-package（也加载data.frame）。使用：

splitstackshape

得到：

data.table

Answer 2

你可以在library(splitstackshape) cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'), 'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]内做V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1

如果你知道V9中有多少个对象，你可以在它上面进行for循环

strsplit()

如果您不知道V9有多少个对象，那么只需在gtf $ V9中的sapply()上运行for (i in 1:number_of_max_objects_in_V9) { gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i]) }，如下所示：

str_count

Answer 3

您可以使用行ID（library(stringr) number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))）将此数据集连接回初始数据集。您还需要在原始数据集中创建# example dataset (only variable of interest included) df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"), stringsAsFactors = F) library(dplyr) library(tidyr) df %>% mutate(id = row_number()) %>% # flag row ids (will need those to reshape data later) separate_rows(V9, sep="; ") %>% # split strings and create new rows separate(V9, c("name","value"), sep=" ") %>% # separate column name from value mutate(value = gsub(";","",value)) %>% # remove ; when necessary spread(name, value) # reshape data # id conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id # 1 1 9.805420 4.347062 25.616962 <NA> 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 2 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 3 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 4 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1。

将data.frame列拆分为其他列

问题描述投票：2回答：3

3个回答

最新问题

将data.frame列拆分为其他列

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3