如何在 R 中分几步收集列而不丢失分组

问题描述 投票:0回答:3

我需要将宽数据集转换为长数据集,有 16 列,必须收敛到 4 列。每 4 列包含彼此相关的信息,并且该信息在转换中不得“丢失”。

我有来自四个块的排名任务的数据,这基本上给了我一个数据集,其中信息以宽格式分为四组。即第一图像、第一性别、第一分数、第二图像、第二性别、第二分数...

我尝试过 group_by 和 Gather() 的各种组合,但还差得很远。

我已经阅读了将多组测量列(宽格式)重塑为单列(长格式),但恐怕我并不知道。

我已经制作了一些参与者数据的示例数据,并且还制作了我希望数据的外观的示例。


library(tidyverse)

sample_dat <- data.frame(subject_id = rep("sj1", 4),
                         first_pick = rep(1, 4),
                         first_image_pick = (c("a", "b", "c", "d")),
                         first_pick_neuro = rep("TD", 4),
                         first_pick_sex = rep("F", 4),
                         second_pick = rep(2, 4),
                         second_image_pick = (c("e", "f", "g", "h")),
                         second_pick_neuro = rep("TD", 4),
                         second_pick_sex = rep("M", 4),
                         third_pick = rep(3, 4),
                         third_image_pick = (c("i", "j", "k", "l")),
                         third_pick_neuro = rep("DS", 4),
                         third_pick_sex = rep("F", 4),
                         fourth_pick = rep(4, 4),
                         fourth_image_pick = (c("m", "n", "o", "p")),
                         fourth_pick_neuro = rep("DS", 4),
                         fourth_pick_sex = rep("M", 4))

预期输出:


final_data <- data.frame(subject_id = rep("sj1", 16),
                         image = c("a", "b", "c", "d",
                                   "e", "f", "g", "h",
                                   "i", "j", "k", "l",
                                   "m", "n", "o", "p"),
                         rank = rep(c(1, 2, 3, 4), each = 4), # from the numbers in the first_pick, second_pick etc. 
                         neuro = rep(c("TD", "DS"), each = 8),
                         sex = rep(c("F", "M", "F", "M"), each = 4))

到目前为止,我已经尝试过这个,但它只复制了所有信息:


sample_dat_long <- sample_dat %>%
  group_by(subject_id) %>%
  gather(Pick, Image,
         first_image_pick,
         second_image_pick,
         third_image_pick,
         fourth_image_pick)  

所以本质上我不想在收集数据时丢失每个图像的信息(选择、性别、神经)。

任何帮助都会很棒!

r reshape
3个回答
2
投票

我们可以使用

melt
中的
data.table
来完成此操作,这可以使用多个
measure
patterns
将“宽”格式重塑为“长”格式。在这里,带有子字符串“image”、“neuro”、“sex”的列名称被重新整形为单独的列以获得预期的输出

library(data.table)
melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), 
   value.name = c("image", "neuro", "sex"), variable.name = 'rank')[, 
    .(subject_id, rank, image, neuro, sex)]

1
投票

我想你可以逐列进行,因为最后你只需要 4 列。 获取应该放在第一个列中的列的索引(如果我理解正确的话):

  ind1 = seq(2,length(sample_dat[1,]), 4) 
  column1 = gather( sample_dat[,ind1] )[2]

然后对所有其他 3 列重复:

  ind2 = seq(3,length(sample_dat[1,]), 4) 
  column2 = gather( sample_dat[,ind2] )[2]

您甚至可以使用 for 循环执行这 4 列,而不是“手动”执行。 然后将它们组合回数据框


1
投票

值得考虑好的列名称(即

"<variable_chr>.<time_num>"
)。但我们可以立即修复它。

pfx <- c("first", "second", "third", "fourth")

names(sample_dat)[-1] <- sapply(names(sample_dat)[-1], function(x) {
  x <- gsub("_pick", "", x)
  if (lengths(strsplit(x, "_")) == 2)
    sub("(^.*)_(.*)", paste("\\2", which(pfx == sub("(^.*)_.+", "\\1", x)), sep="."), x)
  else
    paste0("rank.", which(pfx == x))
})

names(sample_dat)  # good names now
# [1] "subject_id" "rank.1"     "image.1"    "neuro.1"    "sex.1"      "rank.2"    
# [7] "image.2"    "neuro.2"    "sex.2"      "rank.3"     "image.3"    "neuro.3"   
# [13] "sex.3"      "rank.4"     "image.4"    "neuro.4"    "sex.4" 

此后我们就可以轻松使用

reshape

reshape(sample_dat, idvar="subject_id", varying=2:17, direction="long", 
        new.row.names=seq(ncol(sample_dat) - 1))
#    subject_id time rank image neuro sex
# 1         sj1    1    1     a    TD   F
# 2         sj1    1    1     b    TD   F
# 3         sj1    1    1     c    TD   F
# 4         sj1    1    1     d    TD   F
# 5         sj1    2    2     e    TD   M
# 6         sj1    2    2     f    TD   M
# 7         sj1    2    2     g    TD   M
# 8         sj1    2    2     h    TD   M
# 9         sj1    3    3     i    DS   F
# 10        sj1    3    3     j    DS   F
# 11        sj1    3    3     k    DS   F
# 12        sj1    3    3     l    DS   F
# 13        sj1    4    4     m    DS   M
# 14        sj1    4    4     n    DS   M
# 15        sj1    4    4     o    DS   M
# 16        sj1    4    4     p    DS   M

数据

sample_dat <- structure(list(subject_id = structure(c(1L, 1L, 1L, 1L), .Label = "sj1", class = "factor"), 
    first_pick = c(1, 1, 1, 1), first_image_pick = structure(1:4, .Label = c("a", 
    "b", "c", "d"), class = "factor"), first_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), first_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), second_pick = c(2, 
    2, 2, 2), second_image_pick = structure(1:4, .Label = c("e", 
    "f", "g", "h"), class = "factor"), second_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), second_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor"), third_pick = c(3, 
    3, 3, 3), third_image_pick = structure(1:4, .Label = c("i", 
    "j", "k", "l"), class = "factor"), third_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), third_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), fourth_pick = c(4, 
    4, 4, 4), fourth_image_pick = structure(1:4, .Label = c("m", 
    "n", "o", "p"), class = "factor"), fourth_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), fourth_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))
© www.soinside.com 2019 - 2024. All rights reserved.