我有一个没有标题的数据集 .txt 格式,并且有一些行有多列。我复制了下面数据集的一部分。当我使用
readr::read_table()
导入时,值不位于同一列中,并且某些列混合了其他列中的值。有什么办法可以解决这个问题吗?
>chr22_KI270739v1_random AC:KI270739.1 gi:568335374 LN:73985 rg:chr22 rl:unlocalized M5:760fbd73515fedcc9f37737c4a722d6a AS:GRCh38
>chrY_KI270740v1_random AC:KI270740.1 gi:568335373 LN:37240 rg:chrY rl:unlocalized M5:69e42252aead509bf56f1ea6fda91405 AS:GRCh38
>chrUn_KI270302v1 AC:KI270302.1 gi:568335372 LN:2274 rl:unplaced M5:ee6dff38036f7d03478c70717643196e AS:GRCh38
>chrUn_KI270304v1 AC:KI270304.1 gi:568335371 LN:2165 rl:unplaced M5:9423c1b46a48aa6331a77ab5c702ac9d AS:GRCh38
我尝试了这个:
all_chr_GATKbundle <- read_table("all_ch_GATKbundle.txt", col_names = F)
,结果如下:
24 >chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38
25 >chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38
26 >chr1_KI270706v1_random AC:KI270706.1 gi:568335410 LN:175055 rg:chr1 rl:unlocalized M5:62def1a794b3e18192863d187af956e6
27 >chr1_KI270707v1_random AC:KI270707.1 gi:568335409 LN:32032 rg:chr1 rl:unlocalized M5:78135804eb15220565483b7cdd02f3be
这不是正确/完整的表格数据:前两行有八个空格分隔的元素,接下来(最后)两行有七个。因此,
read.table
、read_table
和大多数 read*
函数都会出现问题。它也不是固定宽度的。
它看起来更像是一组变量键:分配给单个(第一个)
chr*
元素的值对。
您可以将其读成“长”形式:
txt <- c(">chr22_KI270739v1_random AC:KI270739.1 gi:568335374 LN:73985 rg:chr22 rl:unlocalized M5:760fbd73515fedcc9f37737c4a722d6a AS:GRCh38", ">chrY_KI270740v1_random AC:KI270740.1 gi:568335373 LN:37240 rg:chrY rl:unlocalized M5:69e42252aead509bf56f1ea6fda91405 AS:GRCh38", ">chrUn_KI270302v1 AC:KI270302.1 gi:568335372 LN:2274 rl:unplaced M5:ee6dff38036f7d03478c70717643196e AS:GRCh38", ">chrUn_KI270304v1 AC:KI270304.1 gi:568335371 LN:2165 rl:unplaced M5:9423c1b46a48aa6331a77ab5c702ac9d AS:GRCh38" )
# txt <- readLines("all_ch_GATKbundle.txt")
outlong <- strsplit(txt, "\\s+") |>
lapply(function(X)
cbind(
data.frame(chr = X[1]),
strcapture("(.*):(.*)", X[-1], list(a="",b=""))
)
) |>
do.call(rbind.data.frame, args = _)
outlong
# chr a b
# 1 >chr22_KI270739v1_random AC KI270739.1
# 2 >chr22_KI270739v1_random gi 568335374
# 3 >chr22_KI270739v1_random LN 73985
# 4 >chr22_KI270739v1_random rg chr22
# 5 >chr22_KI270739v1_random rl unlocalized
# 6 >chr22_KI270739v1_random M5 760fbd73515fedcc9f37737c4a722d6a
# 7 >chr22_KI270739v1_random AS GRCh38
# 8 >chrY_KI270740v1_random AC KI270740.1
# 9 >chrY_KI270740v1_random gi 568335373
# 10 >chrY_KI270740v1_random LN 37240
# 11 >chrY_KI270740v1_random rg chrY
# 12 >chrY_KI270740v1_random rl unlocalized
# 13 >chrY_KI270740v1_random M5 69e42252aead509bf56f1ea6fda91405
# 14 >chrY_KI270740v1_random AS GRCh38
# 15 >chrUn_KI270302v1 AC KI270302.1
# 16 >chrUn_KI270302v1 gi 568335372
# 17 >chrUn_KI270302v1 LN 2274
# 18 >chrUn_KI270302v1 rl unplaced
# 19 >chrUn_KI270302v1 M5 ee6dff38036f7d03478c70717643196e
# 20 >chrUn_KI270302v1 AS GRCh38
# 21 >chrUn_KI270304v1 AC KI270304.1
# 22 >chrUn_KI270304v1 gi 568335371
# 23 >chrUn_KI270304v1 LN 2165
# 24 >chrUn_KI270304v1 rl unplaced
# 25 >chrUn_KI270304v1 M5 9423c1b46a48aa6331a77ab5c702ac9d
# 26 >chrUn_KI270304v1 AS GRCh38
如果您需要宽格式,请认识到某些字段(此处为
a
)并不存在于所有chr
记录中,...
reshape2::dcast(outlong, chr ~ a, value.var = "b")
# chr AC AS gi LN M5 rg rl
# 1 >chr22_KI270739v1_random KI270739.1 GRCh38 568335374 73985 760fbd73515fedcc9f37737c4a722d6a chr22 unlocalized
# 2 >chrUn_KI270302v1 KI270302.1 GRCh38 568335372 2274 ee6dff38036f7d03478c70717643196e <NA> unplaced
# 3 >chrUn_KI270304v1 KI270304.1 GRCh38 568335371 2165 9423c1b46a48aa6331a77ab5c702ac9d <NA> unplaced
# 4 >chrY_KI270740v1_random KI270740.1 GRCh38 568335373 37240 69e42252aead509bf56f1ea6fda91405 chrY unlocalized
或者也许
tidyr::pivot_wider(outlong, id_cols = "chr", names_from = "a", values_from = "b")