我有一个庞大的物质使用数据集,测量过去一周的日常使用情况。我正在尝试编写一个可以轻松处理它的函数。我有两个步骤需要完成:
当前数据被标记为“sub1”、“sub2”、“sub3”……一直到 6。并且有一个相应的“value”变量(例如“sub1_value”= Alcohol)。然后是 sub1_day1..sub1_day2...等等,用数字表示使用量。我想创建基于物质的新变量(例如 day1_alcohol、day2_alcohol、day1_cocaine 等...)
我不想复制、粘贴和编辑这 15 次不同的时间。我正在尝试开发一个可以为我做这件事的功能。
我设置了一个 reprex,每人有 2 种物质,4 种可能的物质(c(“酒精”、“可卡因”、“鸦片制剂”、“大麻”)),使用 3 天。
#Sample Data:
df <-
data.frame(
sub1_value = c("Alcohol", "Alcohol", "Cocaine", "Opiates", "Cannabis"),
sub1_day1 = c(4, 3, 1, 0, 1),
sub1_day2 = c(4, 7, 1, 0, 0),
sub1_day3 = c(5, 6, 0, 1, 1),
sub2_value = c("Cannabis", "Opiates", "Alcohol", "Cocaine", "Alcohol"),
sub2_day1 = c(7, 2, 0, 0, 0),
sub2_day2 = c(3, 2, 1, 1, 1),
sub2_day3 = c(9, 8, 0, 1, 1)
)
这段代码让我处理“sub1”和“sub2”:-有没有更有效的方法来写这个?
df <- df %>%
mutate(
day1_alc = if_else(
sub1_value == "Alcohol",
sub1_day1,
if_else(sub2_value == "Alcohol", sub2_day1,
NA)
),
day2_alc = if_else(
sub1_value == "Alcohol",
sub1_day2,
if_else(sub2_value == "Alcohol", sub2_day2,
NA)
),
day3_alc = if_else(
sub1_value == "Alcohol",
sub1_day3,
if_else(sub2_value == "Alcohol", sub2_day3,
NA)
)
)
问题二 - 如何编写一个函数来为所有的日子和所有物质做这个。正如我提到的,我有很多时间和物质,所以希望尽可能减少工作量。
我期待一个保留原始数据文件但也有变量的数据集 day1_alc, day2_alc, day3_alc 第 1 天_可卡因、第 2 天_可卡因...等 将值从“sub1”复制到适当的新变量。
感谢任何关于 if 语句或函数的指导帮助!
编辑:我想出了一个解决方案——希望在功能部分得到帮助。
旋转到长格式可能看起来像这样:
library(dplyr)
library(tidyr) # pivot_*
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day"))
# # A tibble: 15 × 6
# sub1_value sub2_value rn day sub1 sub2
# <chr> <chr> <int> <chr> <dbl> <dbl>
# 1 Alcohol Cannabis 1 day1 4 7
# 2 Alcohol Cannabis 1 day2 4 3
# 3 Alcohol Cannabis 1 day3 5 9
# 4 Alcohol Opiates 2 day1 3 2
# 5 Alcohol Opiates 2 day2 7 2
# 6 Alcohol Opiates 2 day3 6 8
# 7 Cocaine Alcohol 3 day1 1 0
# 8 Cocaine Alcohol 3 day2 1 1
# 9 Cocaine Alcohol 3 day3 0 0
# 10 Opiates Cocaine 4 day1 0 0
# 11 Opiates Cocaine 4 day2 0 1
# 12 Opiates Cocaine 4 day3 1 1
# 13 Cannabis Alcohol 5 day1 1 0
# 14 Cannabis Alcohol 5 day2 0 1
# 15 Cannabis Alcohol 5 day3 1 1
从这里,我们可以使用
case_when
来确定alc
列:
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
mutate(
alc = case_when(
sub1_value == "Alcohol" ~ sub1,
sub2_value == "Alcohol" ~ sub2,
.default = NA)
)
# # A tibble: 15 × 7
# sub1_value sub2_value rn day sub1 sub2 alc
# <chr> <chr> <int> <chr> <dbl> <dbl> <dbl>
# 1 Alcohol Cannabis 1 day1 4 7 4
# 2 Alcohol Cannabis 1 day2 4 3 4
# 3 Alcohol Cannabis 1 day3 5 9 5
# 4 Alcohol Opiates 2 day1 3 2 3
# 5 Alcohol Opiates 2 day2 7 2 7
# 6 Alcohol Opiates 2 day3 6 8 6
# 7 Cocaine Alcohol 3 day1 1 0 0
# 8 Cocaine Alcohol 3 day2 1 1 1
# 9 Cocaine Alcohol 3 day3 0 0 0
# 10 Opiates Cocaine 4 day1 0 0 NA
# 11 Opiates Cocaine 4 day2 0 1 NA
# 12 Opiates Cocaine 4 day3 1 1 NA
# 13 Cannabis Alcohol 5 day1 1 0 0
# 14 Cannabis Alcohol 5 day2 0 1 1
# 15 Cannabis Alcohol 5 day3 1 1 1
如果您需要以宽格式恢复它(我不推荐它,但以防万一):
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
mutate(alc = case_when(sub1_value == "Alcohol" ~ sub1, sub2_value == "Alcohol" ~ sub2, .default = NA)) %>%
pivot_wider(id_cols = c(rn, sub1_value, sub2_value), names_from = day, values_from = c(sub1, sub2, alc))
# # A tibble: 5 × 12
# rn sub1_value sub2_value sub1_day1 sub1_day2 sub1_day3 sub2_day1 sub2_day2 sub2_day3 alc_day1 alc_day2 alc_day3
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 Alcohol Cannabis 4 4 5 7 3 9 4 4 5
# 2 2 Alcohol Opiates 3 7 6 2 2 8 3 7 6
# 3 3 Cocaine Alcohol 1 1 0 0 1 0 0 1 0
# 4 4 Opiates Cocaine 0 0 1 0 1 1 NA NA NA
# 5 5 Cannabis Alcohol 1 0 1 0 1 1 0 1 1
只有在需要将其恢复为原始宽格式时才需要使用
rn
。如果您有另一个未包含的字段,这是一个很好的 id
类字段,它可能更合适,但我不认为 rn
在这里出错。
这是一个
tidyverse
方法:
library(dplyr)
library(tidyr)
library(stringr)
df1 <- bind_rows(df[, 1:4], df[, 5:8] %>% rename_with(~colnames(df[1:4]))) %>%
rename_with(~str_replace(., ".*\\_", ""))
bind_cols(df1[1], df1 %>%
pivot_longer(-value,
names_to = "name",
values_to = "day") %>%
mutate(id =as.integer(gl(n(),3,n()))) %>%
pivot_wider(names_from = c(name, value),
values_from = day,
names_glue = "{name}_{value}")
) %>%
select(-id)
value day1_Alcohol day2_Alcohol day3_Alcohol day1_Cocaine day2_Cocaine day3_Cocaine day1_Opiates day2_Opiates day3_Opiates day1_Cannabis day2_Cannabis day3_Cannabis
1 Alcohol 4 4 5 NA NA NA NA NA NA NA NA NA
2 Alcohol 3 7 6 NA NA NA NA NA NA NA NA NA
3 Cocaine NA NA NA 1 1 0 NA NA NA NA NA NA
4 Opiates NA NA NA NA NA NA 0 0 1 NA NA NA
5 Cannabis NA NA NA NA NA NA NA NA NA 1 0 1
6 Cannabis NA NA NA NA NA NA NA NA NA 7 3 9
7 Opiates NA NA NA NA NA NA 2 2 8 NA NA NA
8 Alcohol 0 1 0 NA NA NA NA NA NA NA NA NA
9 Cocaine NA NA NA 0 1 1 NA NA NA NA NA NA
10 Alcohol 0 1 1 NA NA NA NA NA NA NA NA NA