R:如何将每5行的选择转换成一行?

问题描述 投票:0回答:1

我有以下(整洁)格式的数据集:

SAMPLE, MARKER, ALLELE, LENGTH, PEAK
BRIS01, B100, allele 1, NA, 126.95
BRIS01, B100, allele 2, 160, 159.72
BRIS01, B100, allele 3, 162, 162.01
BRIS02, B100, allele 1, 152, 151.4
BRIS02, B100, allele 2, NA, NA
BRIS02, B100, allele 3, NA, NA

总的来说,每个样本都有一个14个标记的条目,每个标记都有5个等位基因的条目,即使条目只是“NA”。我不确定有多少样品。

我整天都在尝试将其重组为以下格式,因此对于每个样本,每个标记的所有等位基因值都彼此相邻,但无济于事:

                MARKER 1                              MARKER 2      MARKER 3
      SAMPLE 1, NA, 126.95, 160, 159.72, 162, 162.01, LENGTH, PEAK, LENGTH, PEAK
      SAMPLE 2, 152, 151.4, NA, NA, NA, NA,           LENGTH, PEAK, LENGTH, PEAK

如果格式看起来有点混乱,希望这可能会有所帮助:在每一行中,应该有141列;第一列应包含样品名称,然后从那里开始每个标记的5个等位基因的等位基因长度和峰大小。例如,样本,标记1长度1,标记1峰值1,标记1长度2,标记1峰值2,标记2长度1,标记2峰值2等。这有点反直觉但是想象有列标题每个标记然后子列用于每个等位基因的大小和峰。

我尝试过使用dpylr,整洁的数据,融化,演员,dcast,重塑,重塑,转置...但我对R不太好,并且没有任何运气。在实践中使用长度和峰值作为子列可能不是非常好/整洁的数据,但这是我的老板请求解释数据。任何反馈意见!

谢谢!

编辑:我按照建议运行以下代码:

ultra_wide <-
  wide %>%
  group_by(SAMPLE, MARKER) %>%
  gather(key = "VARS", value = "VALS", c(LENGTH, PEAK)) %>%
  spread(MARKER, VALS) %>%
  summarize(MARKER1 = paste(c(B100), collapse = ", "), 
            MARKER2 = paste(c(B132), collapse = ", "),
            MARKER3 = paste(c(BL13), collapse = ", "),
            MARKER4 = paste(c(BT06), collapse = ", "),
            MARKER5 = paste(c(BT09), collapse = ", "),
            MARKER6 = paste(c(BT30), collapse = ", "),
            MARKER7 = paste(c(BTMS0044), collapse = ", "),
            MARKER8 = paste(c(BTMS0067), collapse = ", "),
            MARKER9 = paste(c(BTMS0106), collapse = ", "),
            MARKER10 = paste(c(B116), collapse = ", "),
            MARKER11 = paste(c(B118), collapse = ", "),
            MARKER12 = paste(c(B119), collapse = ", "),
            MARKER13 = paste(c(BT20), collapse = ", "),
            MARKER14 = paste(c(BTMS0114), collapse = ", "))

但是,该命令没有执行任何操作,因为发生了以下错误:

错误:行的重复标识符(76,77,78,79,80),(30671,30672,30673,30674,30675),(81,82,83,84,85),(30676,30677,30678,30679) ,30680)

之后又继续了几行。

r dataframe data-structures dplyr reshape
1个回答
0
投票

数据输入

首先,请提交重新创建数据框的代码,以便下一个人可以轻松复制并粘贴代码并自行查看数据框。在这里,我只是尝试根据您的规格重新创建数据框,特别是您提到每个标记有五个等位基因的部分。

# Vectors for dataframe

library(tidyverse)

SAMPLE <- c(rep("BRIS01", 5), rep("BRIS02", 5))
MARKER <- c(rep("B100", 5), rep("B200", 5))
ALLELE <- rep(paste("allele",1:5), times = 2)
LENGTH <- c(NA, 160, 162, 152, NA, NA, 160:163)
PEAK <- c(126.95,   159.72, 162.01, 151.4,  NA, NA, 150:153)

marker_data <- data.frame(SAMPLE, MARKER, ALLELE, LENGTH, PEAK, stringsAsFactors = FALSE)

marker_data
#>    SAMPLE MARKER   ALLELE LENGTH   PEAK
#> 1  BRIS01   B100 allele 1     NA 126.95
#> 2  BRIS01   B100 allele 2    160 159.72
#> 3  BRIS01   B100 allele 3    162 162.01
#> 4  BRIS01   B100 allele 4    152 151.40
#> 5  BRIS01   B100 allele 5     NA     NA
#> 6  BRIS02   B200 allele 1     NA     NA
#> 7  BRIS02   B200 allele 2    160 150.00
#> 8  BRIS02   B200 allele 3    161 151.00
#> 9  BRIS02   B200 allele 4    162 152.00
#> 10 BRIS02   B200 allele 5    163 153.00

请注意,在data.frame中,我传递了选项stringsAsFactors = FALSE,因为处理因子变量往往非常棘手。

“传播”你的数据

至于你的输出,我输出你以表格形式显示的内容作为你想要的结果。如果没有更多数据,很难获得每行所需的141列。获得答案的关键是在“聚集”(或“更常见”“熔化”)具有“值”的列(即MARKERLENGTH列)之后“扩散”PEAK列。传播之前;但是,如果传播遇到相同的行,则应创建具有唯一值的列。最后,您必须总结为每个样本获取一行,尽管您希望循环遍历MARKER1-MARKER14列以获得更优/高效的代码。无论如何,我希望这有帮助。

marker_m <- 
  marker_data %>% 
  group_by(SAMPLE, MARKER) %>%
  gather(VARS, VALS, c(LENGTH, PEAK)) %>%
  mutate(i = row_number()) %>%
  spread(MARKER, VALS) %>% 
  summarize(MARKER1 = paste(c(B100), collapse = ", "), MARKER2 = paste(c(B200), collapse = ", "))

marker_m
#> # A tibble: 2 x 3
#>   SAMPLE MARKER1                                                  MARKER2 
#>   <chr>  <chr>                                                    <chr>   
#> 1 BRIS01 NA, 126.95, 160, 159.72, 162, 162.01, 152, 151.4, NA, NA NA, NA,~
#> 2 BRIS02 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA                   NA, NA,~
© www.soinside.com 2019 - 2024. All rights reserved.