我是一名社会科学家,经常处理调查数据。许多变量是四点同意-不同意李克特量表,回答选项为“强烈同意”、“有些同意”、“有些不同意”、“强烈不同意”,但有时是六点量表。数据清理过程的一致部分是将这些变量转换为二分因子(意味着它们有“同意”和“不同意”两个响应选项)。下面是一个示例,其中
data
是数据框,x
是具有所有四个响应选项的原始变量,new_x
是二分变量:
data %>%
mutate(
new_x = case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
)
问题是我经常需要处理超过 30 个变量。我知道我可以使用
across()
对所有 30 个变量进行相同的数据转换,但当我们收到新的调查数据时,我必须每隔几周重复一次。相反,我想要一个名为 make_dicho()
的函数,我可以在 mutate()
和 across()
内部使用它,这样我就不必每次都写出整个 case_match()
表达式。这是构建基本版本的成功尝试:
# create sample data
data <- tibble::tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
)
df
# create the function where values of 1-2 are "Agree" and 3-4 are "Disagree"
make_dicho <- function(var) {
dplyr::case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
}
# check to see if it worked
df %>% mutate(new_x = make_dicho(x))
# success!
此功能有效,但非常脆弱,因为它依赖于调查设计者和调查提供者使用四个响应选项并以非常具体的方式对值进行编码。避免这种情况的一种方法是利用底层元数据,其中包含指示每个值含义的值标签。由于我的大部分数据都包含此元数据,因此我想使用它来自动决定哪些值应重新编码为“同意”,哪些值应重新编码为“不同意”。这使事情变得非常复杂,因为我现在需要为数据框添加一个新参数。这是我到目前为止所想出的:
# add value labels to the data
data <- tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
) %>%
# add value labels
labelled::set_value_labels(
x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
)
# write the new function
make_dicho <- function(df = NULL, var) {
## if var is a symbol convert it to a string
# "Returns a naked expression of the variable"
var <- rlang::enexpr(var)
if (!is.character(var)) {
# convert to a sym() object and then use as_name to make it a string
var <- rlang::as_name(rlang::ensym(var))
}
# Since this is taking advantage of labelled data, it should be of class haven_labelled
if (class(df[[var]])[1] == "haven_labelled") {
### Set up vectors based on the underlying attribute
# get the named vector
labs <- attributes(df[[var]])$labels
# flip the names
labs <- setNames(names(labs), labs)
# get the agree vector by removing the strings containing "disagree" or "Disagree"
agree_vec <- labs[!str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
agree_vec <- enframe(agree_vec) %>%
# put the "value" column at the beginning of the df
relocate(value) %>%
# convert "name" to numeric
mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
deframe()
# get the agree vector by keeping the strings containing "disagree" or "Disagree"
disagree_vec <- labs[str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
disagree_vec <- enframe(disagree_vec) %>%
# put the "value" column at the beginning of the df
relocate(value) %>%
# convert "name" to numeric
mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
deframe()
### now create the case_match function,
# Adding in df[[var]] so that it know which vector to use
dplyr::case_match(
df[[var]],
agree_vec ~ "Agree",
disagree_vec ~ "Disagree"
)
}
}
# test function
data %>% mutate(new_x = make_dicho(x))
此操作失败并给出错误提示
argument "var" is missing, with no default
。但是,如果我在 .
中添加 make_dicho()
,它就会起作用。像这样:
data %>% mutate(new_x = make_dicho(., x))
我的第一个问题是,如何更新我的函数,使其不再需要开头的
.
?其次,我如何让它在across()
中工作?这是我用于 across() 的代码:
# make all three variables dichotomous factors with "new_" prefix
df %>% mutate(
across(
c(x:z),
~make_dicho(., .x),
.names = "new_{col}"
)
)
这是我尝试使用
across()
时收到的错误图像。我的猜测是,它与 .
调用中的 make_dicho()
以及 case_match 中找到的 df[[var]]
调用有关。但老实说,我不知道,虽然感觉我非常接近,但据我所知,这个功能可能会搞砸。
希望这个请求虽然有点复杂,但很容易理解。感谢您的帮助!
不涉及调查评估的方法,只是简单地考虑 预期的编码,这可能是一种方法。您仍然需要知道各个变量的编码,并将它们输入到
.cols
中,并在 make_dicho
中输入相应的 mutate(across(...))
参数。
library(tidyverse)
## create sample data
## let's say `up` refers to higher values indicate higher agreement, `down`
## indicates lower values indicate higher agreement. let's also introduce
## some "errors" you may be faced with (e.g., `NA`, unrealistic values).
dat <- tibble(
sc6_up_x = c(1,4,5,2),
sc6_down_x = c(2,5,1000,4),
sc6_up_y = c(NA,6,1,6),
sc4_down_x = c(3,4,2,1),
sc4_down_y = c(2,4,3,1),
sc4_up_x = c(3,2,1,4)
)
dat
# create function
make_dicho <- function(x_var, agr_l, agr_u, dis_l, dis_u) {
dplyr::case_match(
x_var,
c(agr_l:agr_u) ~ "Agree",
c(dis_l:dis_u) ~ "Disagree"
)
}
## check to see if it works when using one variable
dat %>%
mutate(new_x = make_dicho(sc6_down_x, 1,3,4,6))
## apply function to several variables of interest and with appropriate
## arguments
dat %>%
mutate(
across(
.cols = c(sc6_up_x, sc6_up_y),
.fns = list(
make_dicho = ~ make_dicho(.x,4,6,1,3)
),
.names = "new_{col}"
),
across(
.cols = c(sc6_down_x),
.fns = list(
make_dicho = ~ make_dicho(.x,1,3,4,6)
),
.names = "new_{col}"
),
across(
.cols = c(sc4_up_x),
.fns = list(
make_dicho = ~ make_dicho(.x,3,4,1,2)
),
.names = "new_{col}"
),
across(
.cols = c(sc4_down_x, sc4_down_y),
.fns = list(
make_dicho = ~ make_dicho(.x,1,2,3,4)
),
.names = "new_{col}"
)
)