我有一个具有此类内容的数据集:
data <- read.table(header=TRUE, text='ID Age Gender M1.Date M1.Code M2.Date M2.Code M3.Other M3.Code M3.Data
ID1 34 Male 23-Oct-18 M1 02-Apr-18 M2 31 M3 14-Jul-18
ID1 34 Male 02-Sep-19 M1 12-May-18 M2 23 M3 03-Nov-18
ID1 34 Male NA NA 03-Dec-18 M2 85 M3 04-Oct-19
ID1 34 Male NA NA 31-May-19 M2 NA NA NA
ID2 21 Female 13-May-18 M1 24-Jun-18 M2 734 M3 31-Aug-18
ID2 21 Female 21-Dec-18 M1 NA NA 12 M3 08-Apr-19
ID2 21 Female NA NA NA NA 14 M3 16-Aug-19')
对于每个个体(ID
),都有关于三种表型(Data
,Code
和Other
)的信息(M1
,M2
,M3
)。但是,每个表型的信息数量和类型因人而异。
为了分析数据(数百个个体),我想将数据集从宽转换为长。
理想情况下,输出看起来像这样:
output <- read.table(header=TRUE, text='ID Age Gender Code Date Other
ID1 34 Male F1 23-Oct-2018 NA
ID1 34 Male F1 02-Sept-2019 NA
ID1 34 Male F2 02-Apr-2018 NA
ID1 34 Male F2 12-May-2018 NA
ID1 34 Male F2 03-Dec-2018 NA
ID1 34 Male F2 31-May-2019 NA
ID1 34 Male F3 14-Jul-18 31
ID1 34 Male F3 03-Nov-18 23
ID1 34 Male F3 04-Oct-19 85
ID2 21 Female F1 13-May-2018 NA
ID2 21 Female F1 21-Dec-2018 NA
ID2 21 Female F2 24-Jun-2018 NA
ID2 21 Female F3 31-Aug-18 734
ID2 21 Female F3 08-Apr-19 12
ID2 21 Female F3 16-Aug-19 14')
我应该如何进行?我尝试在R中使用重塑函数,但是由于每种表型的信息数量各不相同,因此我无法使其正常工作。
这也可以使用pivot_longer
通过tidyr
完成。
由于不同变量缺少一些列(例如,您没有M2.Other
或M3.Other
,只有M1.Other
),因此您需要为这些变量添加额外的列并填充NA
。
如果变量以数字结尾而不是中间(将变量命名为MDate_1
而不是M1.Date
,这也将变得更加容易)。但是可以将变量名重命名为这种效果。
我希望这会有所帮助。
library(tidyr)
data <- read.table(header=TRUE, text='ID Age Gender M1.Date M1.Code M2.Date M2.Code M3.Other M3.Code M3.Date
ID1 34 Male 23-Oct-18 M1 02-Apr-18 M2 31 M3 14-Jul-18
ID1 34 Male 02-Sep-19 M1 12-May-18 M2 23 M3 03-Nov-18
ID1 34 Male NA NA 03-Dec-18 M2 85 M3 04-Oct-19
ID1 34 Male NA NA 31-May-19 M2 NA NA NA
ID2 21 Female 13-May-18 M1 24-Jun-18 M2 734 M3 31-Aug-18
ID2 21 Female 21-Dec-18 M1 NA NA 12 M3 08-Apr-19
ID2 21 Female NA NA NA NA 14 M3 16-Aug-19')
# Recode to have numbers at end after underscore instead of embedded in middle
names(data) <- sub("M(\\d).(\\w*)", "M\\2_\\1", names(data))
# Get all combinations of variables since there are missing columns
all_vars <- expand_grid(
value = c("MDate", "MCode", "MOther"),
time = 1:3
) %>% unite("vars", everything())
# Get missing vars and set to NA so have complete variables for pivot_longer
missing_vars <- setdiff(all_vars$vars, names(data))
data[missing_vars] <- list(NA)
# Convert variables to data types and make data longer
data %>%
mutate_at(vars(starts_with("MDate")), funs(as.Date(., "%d-%b-%y")),
vars(starts_with("MCode")), funs(as.character),
vars(starts_with("MOther")), funs(as.character)) %>%
pivot_longer(
cols = -c(ID, Age, Gender),
names_to = c(".value", "time"),
names_sep = "_",
values_drop_na = TRUE)
# A tibble: 15 x 7
ID Age Gender time MDate MCode MOther
<fct> <int> <fct> <chr> <date> <fct> <int>
1 ID1 34 Male 1 2018-10-23 M1 NA
2 ID1 34 Male 2 2018-04-02 M2 NA
3 ID1 34 Male 3 2018-07-14 M3 31
4 ID1 34 Male 1 2019-09-02 M1 NA
5 ID1 34 Male 2 2018-05-12 M2 NA
6 ID1 34 Male 3 2018-11-03 M3 23
7 ID1 34 Male 2 2018-12-03 M2 NA
8 ID1 34 Male 3 2019-10-04 M3 85
9 ID1 34 Male 2 2019-05-31 M2 NA
10 ID2 21 Female 1 2018-05-13 M1 NA
11 ID2 21 Female 2 2018-06-24 M2 NA
12 ID2 21 Female 3 2018-08-31 M3 734
13 ID2 21 Female 1 2018-12-21 M1 NA
14 ID2 21 Female 3 2019-04-08 M3 12
15 ID2 21 Female 3 2019-08-16 M3 14