base::levels
帮助文件https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html包含以下修改变量级别的示例:
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
假设这些东西位于数据框内:
mydata <- data.frame(z=gl(3, 2, 12, labels = c("apple", "salad", "orange")), n=1:12)
我想编写一个函数来执行将数据框和变量名作为输入的级别转换:
modify_levels <- function(df,varname,from,to) {
### MAGIC HAPPENS
}
所以modify_levels(mydata,z,from=c("apple","orange"),to="fruit")
做了转换的一部分(而modify_levels(mydata,z,from=c("salad","broccoli"),to="veg")
做第二部分,即使我的数据集中可能不存在级别broccoli
)。
对于一些非标准评估巫术,我可以缩小我需要修改的内容:
where_are_levels <- function(df,varname,from,to,verbose=FALSE) {
# input checks
if ( !is.data.frame(df) ) {
stop("df is not a data frame")
}
if ( !is.factor(eval(substitute(varname),df)) ) {
stop("df$varname is not a factor")
}
if (verbose==TRUE) {
cat("df$varname is",
paste0(substitute(df),"$",substitute(varname)))
cat(" which evaluates to:\n")
print( eval(substitute(varname),df) )
}
if (length(to)!=1) {
stop("Substitution is ambiguous")
}
# figure out what the cases are with the supplied source values
for (val in from) {
r <- (eval(substitute(varname),df) == val)
if (verbose==TRUE) {
print(r)
cat( paste0(substitute(df),"$",substitute(varname)),"==",val)
cat(": ",sum(r), "case(s)\n")
}
}
}
到目前为止,这么好(to
选项什么也没做):
> where_are_levels(mydata,z,from=c("apple","orange"),to="",verbose=TRUE)
## df$varname is mydata$z which evaluates to:
## [1] apple apple salad salad orange orange apple apple salad salad orange orange
## Levels: apple salad orange
## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## mydata$z == apple: 4 case(s)
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## mydata$z == orange: 4 case(s)
现在,对于下一步,我认为我需要做的是将目标变量的级别附加一个额外的级别,并更改该变量的值。在互动工作中,我愿意
# to <- "fruit" # passed as a function argument
l1 <- levels(mydata$z)
levels(mydata$z) <- union(l1,to)
mydata[r,"z"] <- to
其中我只能在val
周期内以编程方式获得第一行:
l1 <- levels(eval(substitute(varname),df))
这将发生在val
周期内。
请注意,我希望保留苹果和橙子的现有级别,而不是仅仅改变整个事物(如帮助文件中的大修示例中所做的那样)。
如果通过dplyr
编程从头开始更容易实现解决方案,那对我来说没问题(虽然我的理解是它的NSE在dplyr
中比在基础R中更硬)。
不需要所有的替换,一个就足够了。我会保留你的所有信息
where_are_levels <- function(df,varname,from,to,verbose=FALSE) {
# input checks
varname <- substitute(varname)
if (!is.data.frame(df)) {
stop("df is not a data frame")
}
if (!is.factor(df[[varname]])) {
stop("df$varname is not a factor")
}
if (verbose) {
cat("df$varname is", paste0(substitute(df),"$",varname))
cat(" which evaluates to:\n")
print(df[[varname]])
}
if (length(to) != 1) {
stop("Substitution is ambiguous")
}
# figure out what the cases are with the supplied source values
r <- df[[varname]] %in% from
new_levels <- union(levels(df[[varname]]), to)
df[[varname]] <- factor(df[[varname]], new_levels)
df[[varname]] <- replace(df[[varname]], r, to)
if (verbose) {
print(r)
cat( paste0(df[[varname]]),"==",from)
cat(": ",sum(r), "case(s)\n")
}
return(df)
}
where_are_levels(mydata,z,from=c("apple","orange"),to="fruit") z n 1 fruit 1 2 fruit 2 3 salad 3 4 salad 4 5 fruit 5 6 fruit 6 7 fruit 7 8 fruit 8 9 salad 9 10 salad 10 11 fruit 11 12 fruit 12
我认为不需要非标准评估或任何整齐的魔法。只需使用普通的“[[”和levels<-
modify_levels <- function(dfrm, cname, from=NA,to=NA) {
pos <- which( from %in% levels(dfrm[[cname]]) )
levels(dfrm[[cname]])[pos] <- to
dfrm[[cname]]} # be sure to assign the result back
使用:
> modify_levels(mydata,'z',from=c("salad","broccoli"),to="veg")
[1] fruit fruit veg veg fruit fruit fruit fruit veg veg fruit fruit
Levels: fruit veg
但是需要分配结果:
> mydata$z <- modify_levels(mydata,'z',from=c("salad","broccoli"),to="veg")
> mydata
z n
1 fruit 1
2 fruit 2
3 veg 3
4 veg 4
5 fruit 5
6 fruit 6
7 fruit 7
8 fruit 8
9 veg 9
10 veg 10
11 fruit 11
12 fruit 12
您可以将功能更改为:
where_are_levels<-function(mydata,varname,from, to, additional){
mydata[[varname]]<-plyr::mapvalues(mydata[[varname]], from = from, to = to)
mydata[[varname]]<-factor(mydata[[varname]],levels=c(levels(mydata[[varname]]),additional))
return(mydata)
}
例:
varname="z"
from = c("apple", "salad","orange")
to = c("fruit", "veg", "fruit")
additional="Milk"
a<-where_are_levels(mydata,varname,from, to, additional)