将data.frame列拆分为其他列

问题描述 投票:2回答:3

我有一个很大的data.frame有一些列,但我的第9列是由分号分隔的数据:

    gtf$V9
1                 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;

所以我想把这个列切换到其他列和merge这个稍后用data.frame的其他部分(第9列之前的其他列)。

我尝试了一些没有结果的代码:

head(gtf$V9, sep = ";",stringsAsFactors = FALSE) 

要么

new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"

as.data.framedata.frameas.matrix相同

我也尝试过qazxsw poi并将其导入包括qazxsw poi,但write.csv太大而且我的电脑滞后..

有什么建议?

r dataframe split multiple-columns
3个回答
3
投票

另一种选择是使用sep=";"-package(也加载data.frame)。使用:

splitstackshape

得到:

data.table

1
投票

你可以在library(splitstackshape) cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'), 'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)] 内做 V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1

如果你知道V9中有多少个对象,你可以在它上面进行for循环

strsplit()

如果您不知道V9有多少个对象,那么只需在gtf $ V9中的sapply()上运行for (i in 1:number_of_max_objects_in_V9) { gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i]) } ,如下所示:

str_count

1
投票
,

您可以使用行ID(library(stringr) number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,','))) )将此数据集连接回初始数据集。您还需要在原始数据集中创建# example dataset (only variable of interest included) df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;", "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"), stringsAsFactors = F) library(dplyr) library(tidyr) df %>% mutate(id = row_number()) %>% # flag row ids (will need those to reshape data later) separate_rows(V9, sep="; ") %>% # split strings and create new rows separate(V9, c("name","value"), sep=" ") %>% # separate column name from value mutate(value = gsub(";","",value)) %>% # remove ; when necessary spread(name, value) # reshape data # id conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id # 1 1 9.805420 4.347062 25.616962 <NA> 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 2 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 3 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 # 4 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1

© www.soinside.com 2019 - 2024. All rights reserved.