如何在 R 中导入多列数据集

问题描述 投票:0回答:1

我有一个没有标题的数据集 .txt 格式,并且有一些行有多列。我复制了下面数据集的一部分。当我使用

readr::read_table()
导入时,值不位于同一列中,并且某些列混合了其他列中的值。有什么办法可以解决这个问题吗?

>chr22_KI270739v1_random  AC:KI270739.1  gi:568335374  LN:73985  rg:chr22  rl:unlocalized  M5:760fbd73515fedcc9f37737c4a722d6a  AS:GRCh38
>chrY_KI270740v1_random  AC:KI270740.1  gi:568335373  LN:37240  rg:chrY  rl:unlocalized  M5:69e42252aead509bf56f1ea6fda91405  AS:GRCh38
>chrUn_KI270302v1  AC:KI270302.1  gi:568335372  LN:2274  rl:unplaced  M5:ee6dff38036f7d03478c70717643196e  AS:GRCh38
>chrUn_KI270304v1  AC:KI270304.1  gi:568335371  LN:2165  rl:unplaced  M5:9423c1b46a48aa6331a77ab5c702ac9d  AS:GRCh38

我尝试了这个:

all_chr_GATKbundle <- read_table("all_ch_GATKbundle.txt", col_names = F)
,结果如下:

24 >chrY                    AC:CM000686.2 gi:568336000 LN:57227415  rl:Chromosome    M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38                          
 25 >chrM                    AC:J01415.2   gi:113200490 LN:16569     rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38                          
 26 >chr1_KI270706v1_random  AC:KI270706.1 gi:568335410 LN:175055    rg:chr1          rl:unlocalized                      M5:62def1a794b3e18192863d187af956e6
 27 >chr1_KI270707v1_random  AC:KI270707.1 gi:568335409 LN:32032     rg:chr1          rl:unlocalized                      M5:78135804eb15220565483b7cdd02f3be
r import dataset
1个回答
0
投票

这不是正确/完整的表格数据:前两行有八个空格分隔的元素,接下来(最后)两行有七个。因此,

read.table
read_table
和大多数
read*
函数都会出现问题。它也不是固定宽度的。

它看起来更像是一组变量键:分配给单个(第一个)

chr*
元素的值对。

您可以将其读成“长”形式:

txt <- c(">chr22_KI270739v1_random  AC:KI270739.1  gi:568335374  LN:73985  rg:chr22  rl:unlocalized  M5:760fbd73515fedcc9f37737c4a722d6a  AS:GRCh38", ">chrY_KI270740v1_random  AC:KI270740.1  gi:568335373  LN:37240  rg:chrY  rl:unlocalized  M5:69e42252aead509bf56f1ea6fda91405  AS:GRCh38", ">chrUn_KI270302v1  AC:KI270302.1  gi:568335372  LN:2274  rl:unplaced  M5:ee6dff38036f7d03478c70717643196e  AS:GRCh38", ">chrUn_KI270304v1  AC:KI270304.1  gi:568335371  LN:2165  rl:unplaced  M5:9423c1b46a48aa6331a77ab5c702ac9d  AS:GRCh38" )
# txt <- readLines("all_ch_GATKbundle.txt")

outlong <- strsplit(txt, "\\s+") |>
  lapply(function(X)
    cbind(
      data.frame(chr = X[1]),
      strcapture("(.*):(.*)", X[-1], list(a="",b=""))
    )
  ) |>
  do.call(rbind.data.frame, args = _)
outlong
#                         chr  a                                b
# 1  >chr22_KI270739v1_random AC                       KI270739.1
# 2  >chr22_KI270739v1_random gi                        568335374
# 3  >chr22_KI270739v1_random LN                            73985
# 4  >chr22_KI270739v1_random rg                            chr22
# 5  >chr22_KI270739v1_random rl                      unlocalized
# 6  >chr22_KI270739v1_random M5 760fbd73515fedcc9f37737c4a722d6a
# 7  >chr22_KI270739v1_random AS                           GRCh38
# 8   >chrY_KI270740v1_random AC                       KI270740.1
# 9   >chrY_KI270740v1_random gi                        568335373
# 10  >chrY_KI270740v1_random LN                            37240
# 11  >chrY_KI270740v1_random rg                             chrY
# 12  >chrY_KI270740v1_random rl                      unlocalized
# 13  >chrY_KI270740v1_random M5 69e42252aead509bf56f1ea6fda91405
# 14  >chrY_KI270740v1_random AS                           GRCh38
# 15        >chrUn_KI270302v1 AC                       KI270302.1
# 16        >chrUn_KI270302v1 gi                        568335372
# 17        >chrUn_KI270302v1 LN                             2274
# 18        >chrUn_KI270302v1 rl                         unplaced
# 19        >chrUn_KI270302v1 M5 ee6dff38036f7d03478c70717643196e
# 20        >chrUn_KI270302v1 AS                           GRCh38
# 21        >chrUn_KI270304v1 AC                       KI270304.1
# 22        >chrUn_KI270304v1 gi                        568335371
# 23        >chrUn_KI270304v1 LN                             2165
# 24        >chrUn_KI270304v1 rl                         unplaced
# 25        >chrUn_KI270304v1 M5 9423c1b46a48aa6331a77ab5c702ac9d
# 26        >chrUn_KI270304v1 AS                           GRCh38

如果您需要宽格式,请认识到某些字段(此处为

a
)并不存在于所有
chr
记录中,...

reshape2::dcast(outlong, chr ~ a, value.var = "b")
#                        chr         AC     AS        gi    LN                               M5    rg          rl
# 1 >chr22_KI270739v1_random KI270739.1 GRCh38 568335374 73985 760fbd73515fedcc9f37737c4a722d6a chr22 unlocalized
# 2        >chrUn_KI270302v1 KI270302.1 GRCh38 568335372  2274 ee6dff38036f7d03478c70717643196e  <NA>    unplaced
# 3        >chrUn_KI270304v1 KI270304.1 GRCh38 568335371  2165 9423c1b46a48aa6331a77ab5c702ac9d  <NA>    unplaced
# 4  >chrY_KI270740v1_random KI270740.1 GRCh38 568335373 37240 69e42252aead509bf56f1ea6fda91405  chrY unlocalized

或者也许

tidyr::pivot_wider(outlong, id_cols = "chr", names_from = "a", values_from = "b")
© www.soinside.com 2019 - 2024. All rights reserved.