我需要将宽数据集转换为长数据集,有 16 列,必须收敛到 4 列。每 4 列包含彼此相关的信息,并且该信息在转换中不得“丢失”。
我有来自四个块的排名任务的数据,这基本上给了我一个数据集,其中信息以宽格式分为四组。即第一图像、第一性别、第一分数、第二图像、第二性别、第二分数...
我尝试过 group_by 和 Gather() 的各种组合,但还差得很远。
我已经阅读了将多组测量列(宽格式)重塑为单列(长格式),但恐怕我并不知道。
我已经制作了一些参与者数据的示例数据,并且还制作了我希望数据的外观的示例。
library(tidyverse)
sample_dat <- data.frame(subject_id = rep("sj1", 4),
first_pick = rep(1, 4),
first_image_pick = (c("a", "b", "c", "d")),
first_pick_neuro = rep("TD", 4),
first_pick_sex = rep("F", 4),
second_pick = rep(2, 4),
second_image_pick = (c("e", "f", "g", "h")),
second_pick_neuro = rep("TD", 4),
second_pick_sex = rep("M", 4),
third_pick = rep(3, 4),
third_image_pick = (c("i", "j", "k", "l")),
third_pick_neuro = rep("DS", 4),
third_pick_sex = rep("F", 4),
fourth_pick = rep(4, 4),
fourth_image_pick = (c("m", "n", "o", "p")),
fourth_pick_neuro = rep("DS", 4),
fourth_pick_sex = rep("M", 4))
预期输出:
final_data <- data.frame(subject_id = rep("sj1", 16),
image = c("a", "b", "c", "d",
"e", "f", "g", "h",
"i", "j", "k", "l",
"m", "n", "o", "p"),
rank = rep(c(1, 2, 3, 4), each = 4), # from the numbers in the first_pick, second_pick etc.
neuro = rep(c("TD", "DS"), each = 8),
sex = rep(c("F", "M", "F", "M"), each = 4))
到目前为止,我已经尝试过这个,但它只复制了所有信息:
sample_dat_long <- sample_dat %>%
group_by(subject_id) %>%
gather(Pick, Image,
first_image_pick,
second_image_pick,
third_image_pick,
fourth_image_pick)
所以本质上我不想在收集数据时丢失每个图像的信息(选择、性别、神经)。
任何帮助都会很棒!
我们可以使用
melt
中的 data.table
来完成此操作,这可以使用多个 measure
patterns
将“宽”格式重塑为“长”格式。在这里,带有子字符串“image”、“neuro”、“sex”的列名称被重新整形为单独的列以获得预期的输出
library(data.table)
melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"),
value.name = c("image", "neuro", "sex"), variable.name = 'rank')[,
.(subject_id, rank, image, neuro, sex)]
我想你可以逐列进行,因为最后你只需要 4 列。 获取应该放在第一个列中的列的索引(如果我理解正确的话):
ind1 = seq(2,length(sample_dat[1,]), 4)
column1 = gather( sample_dat[,ind1] )[2]
然后对所有其他 3 列重复:
ind2 = seq(3,length(sample_dat[1,]), 4)
column2 = gather( sample_dat[,ind2] )[2]
您甚至可以使用 for 循环执行这 4 列,而不是“手动”执行。 然后将它们组合回数据框
值得考虑好的列名称(即
"<variable_chr>.<time_num>"
)。但我们可以立即修复它。
pfx <- c("first", "second", "third", "fourth")
names(sample_dat)[-1] <- sapply(names(sample_dat)[-1], function(x) {
x <- gsub("_pick", "", x)
if (lengths(strsplit(x, "_")) == 2)
sub("(^.*)_(.*)", paste("\\2", which(pfx == sub("(^.*)_.+", "\\1", x)), sep="."), x)
else
paste0("rank.", which(pfx == x))
})
names(sample_dat) # good names now
# [1] "subject_id" "rank.1" "image.1" "neuro.1" "sex.1" "rank.2"
# [7] "image.2" "neuro.2" "sex.2" "rank.3" "image.3" "neuro.3"
# [13] "sex.3" "rank.4" "image.4" "neuro.4" "sex.4"
此后我们就可以轻松使用
reshape
。
reshape(sample_dat, idvar="subject_id", varying=2:17, direction="long",
new.row.names=seq(ncol(sample_dat) - 1))
# subject_id time rank image neuro sex
# 1 sj1 1 1 a TD F
# 2 sj1 1 1 b TD F
# 3 sj1 1 1 c TD F
# 4 sj1 1 1 d TD F
# 5 sj1 2 2 e TD M
# 6 sj1 2 2 f TD M
# 7 sj1 2 2 g TD M
# 8 sj1 2 2 h TD M
# 9 sj1 3 3 i DS F
# 10 sj1 3 3 j DS F
# 11 sj1 3 3 k DS F
# 12 sj1 3 3 l DS F
# 13 sj1 4 4 m DS M
# 14 sj1 4 4 n DS M
# 15 sj1 4 4 o DS M
# 16 sj1 4 4 p DS M
sample_dat <- structure(list(subject_id = structure(c(1L, 1L, 1L, 1L), .Label = "sj1", class = "factor"),
first_pick = c(1, 1, 1, 1), first_image_pick = structure(1:4, .Label = c("a",
"b", "c", "d"), class = "factor"), first_pick_neuro = structure(c(1L,
1L, 1L, 1L), .Label = "TD", class = "factor"), first_pick_sex = structure(c(1L,
1L, 1L, 1L), .Label = "F", class = "factor"), second_pick = c(2,
2, 2, 2), second_image_pick = structure(1:4, .Label = c("e",
"f", "g", "h"), class = "factor"), second_pick_neuro = structure(c(1L,
1L, 1L, 1L), .Label = "TD", class = "factor"), second_pick_sex = structure(c(1L,
1L, 1L, 1L), .Label = "M", class = "factor"), third_pick = c(3,
3, 3, 3), third_image_pick = structure(1:4, .Label = c("i",
"j", "k", "l"), class = "factor"), third_pick_neuro = structure(c(1L,
1L, 1L, 1L), .Label = "DS", class = "factor"), third_pick_sex = structure(c(1L,
1L, 1L, 1L), .Label = "F", class = "factor"), fourth_pick = c(4,
4, 4, 4), fourth_image_pick = structure(1:4, .Label = c("m",
"n", "o", "p"), class = "factor"), fourth_pick_neuro = structure(c(1L,
1L, 1L, 1L), .Label = "DS", class = "factor"), fourth_pick_sex = structure(c(1L,
1L, 1L, 1L), .Label = "M", class = "factor")), class = "data.frame", row.names = c(NA,
-4L))