当在列标题中编码id变量时,将数据从宽格式转换为长格式

问题描述 投票:5回答:3

我对R比较陌生,并且具有如下所示的宽格式数据

subject_id   age    sex  treat1.1.param1    treat1.1.param2   treat1.2.param1   treat1.2.param2
-----------------------------------------------------------------------------------------------
1             23     M         1                  2                  3                   4
2             25     W         5                  6                  7                   8

这是针对给定治疗方法(此处为treat1)所针对的多个受试者的数据,在多轮重复测量(此处为第一轮和第二轮)中测量了多个参数(此处为param1和param2)。如上所示,在该列标题中编码了该对象的条目所属的处理,回合和参数的信息。

我希望将长格式的数据示例如下:

subject_id  age sex treatment   round       param1      param2
------------------------------------------------------------------------------------------
1           23   M   treat1      1           1          2
1           23   M   treat1      2           3          4
2           25   W   treat1      1           5          6
2           25   W   treat1      2           7          8

即标识单个观察值的id变量是subject_id,治疗,舍入。但是由于后两个变量是使用点作为分隔符在列标题中编码的,所以我不知道如何从宽格式转换为长格式。所有带有标准示例的尝试(使用reshape2或tidyr)均失败。由于实际上,我每12轮进行12种处理,每轮约有50个参数,所以相对手动的处理方式不会对我有太大帮助。

有人可以帮助我并提供代码示例吗?

非常感谢

Jan

r long-integer
3个回答
4
投票

我们可以使用pivot_longer中的tidyr指定names_tonames_pattern参数。

tidyr::pivot_longer(df, 
                    cols = starts_with("treat"), 
                    names_to = c("treatmeant", "round", ".value"), 
                    names_pattern =  "(\\w+)\\.(\\d+)\\.(\\w+)")

#  subject_id   age sex   treatmeant round param1 param2
#       <int> <int> <fct> <chr>      <chr>  <int>  <int>
#1          1    23 M     treat1     1          1      2
#2          1    23 M     treat1     2          3      4
#3          2    25 W     treat1     1          5      6
#4          2    25 W     treat1     2          7      8

数据

df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2, 
.Label = c("M", "W"), class = "factor"), 
treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L), 
treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)), 
class = "data.frame", row.names = c(NA, -2L))

1
投票

您可以使用tidyr gatherseparatespread

tibble::tibble(subject_id = 1:2,
               age = c(23,25),
               sex = c("M", "W"),
               round_1_param_1 = c(1,5),
               round_1_param_2 = c(2,6),
               round_2_param_1 = c(3,7),
               round_2_param_2 = c(4,8)) %>% 
  tidyr::gather("key", "value", -subject_id, -age, -sex) %>% 
  tidyr::separate(key, c("round", "param"), sep = "param") %>%
  dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>% 
  tidyr::spread(param, value)

# A tibble: 4 x 6
  subject_id   age sex   round   `1`   `2`
       <int> <dbl> <chr> <dbl> <dbl> <dbl>
1          1    23 M         1     1     2
2          1    23 M         2     3     4
3          2    25 W         1     5     6
4          2    25 W         2     7     8

0
投票

这里是可能的data.table方法,

library(data.table)

dcast(melt(dd, id.vars = c("subject_id", "age", 'sex'))
      [, .(subject_id, age, sex, gsub('(\\w+)\\.\\d\\.\\w+', '\\1', variable),
                                 gsub('\\w+\\.(\\d)\\.\\w+', '\\1', variable),
                                 gsub('\\w+\\.\\d\\.(\\w+)', '\\1', variable), value)],
      subject_id + age + sex + V4 + V5 ~ V6)

给出,

   subject_id age sex     V4 V5 param1 param2
1:          1  23   M treat1  1      1      2
2:          1  23   M treat1  2      3      4
3:          2  25   W treat1  1      5      6
4:          2  25   W treat1  2      7      8
© www.soinside.com 2019 - 2024. All rights reserved.