我有一个家庭和成员数据集,采用一种长平面格式。有固定数量的成员,每个成员对应一列。为简单起见,假设每个家庭有2名成员,并假设成员年龄(Q1),性别(Q2)有2个问题。
文件格式如下所示:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
我想将其转换为以下格式:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F
假设我们的数据框是测试的
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
您可以在此数据框上尝试重塑功能,如下所示:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
reshape()函数来自基础R.广义上讲,它可以通过使用变化参数并将方向设置为long来同时融合多组变量。
例如,在您的情况下,我们有一个变量名称的三个变量名向量列表:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
输出如下:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female
您可以按顺序使用tidyr::gather()
,tidyr::separate()
和tidyr::spread()
。这里household
是您的数据框的名称。
library(tidyverse)
gather
首先,tidyr::gather()
。然后你可以得到以下结果。
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
现在你所要做的就是
domestic
分开_[0-9]
专栏:正则表达式,_(?=[0-9])
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F