我一直在使用以下函数来创建偶数 bin 变量:
## Even Bins Funtion
evenbins <- function(x, bin.count = 5, order = T) {
bin.size <- rep(length(x) %/% bin.count, bin.count)
bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
bin <- rep(1:bin.count, bin.size)
if(order) {
bin <- bin[rank(x, ties.method = "random")]
}
return(factor(bin, levels = 1:bin.count, ordered = order))
}
这对于对数值进行分箱非常有用,但是,它将 NA 分组到最后组(在本例中为第 5 个分箱)。所以如果旋转的话它会做这样的事情:
我想调整该函数以从初始分箱功能中删除 NA 并将它们保留为 NA 值,因此当我对 bin 列进行分组时,它会产生以下结果:
预先感谢您的阅读和任何帮助!!
可使用的示例代码:
##set up fake dataset
df1 <- data.frame(x = c(1:450))
df2 <- data.frame(x = 1:50)
df2$x <- NA
df3 <- rbind (df1, df2 )
## Even Bins Funtion
evenbins <- function(x, bin.count = 5, order = T) {
bin.size <- rep(length(x) %/% bin.count, bin.count)
bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
bin <- rep(1:bin.count, bin.size)
if(order) {
bin <- bin[rank(x, ties.method = "random")]
}
return(factor(bin, levels = 1:bin.count, ordered = order))
}
df3$Bin <- evenbins(df3$x)
df3$isNA <- ifelse(is.na(df3$x) == TRUE, "# NA","complete")
t1 <- cbind(
table(df3$Bin)
,table(df3$Bin, df3$isNA)
)
这是一个简单的修改 - 计算
NA
的数量,将其删除,然后在最后再次将它们钉上:
evenbins <- function(x, bin.count = 5, order = T) {
n_na = sum(is.na(x))
x = na.omit(x)
bin.size <- rep(length(x) %/% bin.count, bin.count)
bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
bin <- rep(1:bin.count, bin.size)
if(order) {
bin <- bin[rank(x, ties.method = "random")]
}
return(factor(c(bin, rep(NA, n_na)), levels = 1:bin.count, ordered = order))
}
df3 <- rbind (df1, df2 )
df3$Bin <- evenbins(df3$x)
df3$isNA <- ifelse(is.na(df3$x), "# NA","complete")
cbind(
table(df3$Bin, useNA = "always")
,table(df3$Bin, df3$isNA, useNA = "always")
)
# # NA complete <NA>
# 1 90 0 90 0
# 2 90 0 90 0
# 3 90 0 90 0
# 4 90 0 90 0
# 5 90 0 90 0
# <NA> 50 50 0 0
这是一个相当简单的基本解决方案:
as.data.frame( table( (df3+100) %/% 100, useNA="always") , make.names = TRUE)
x Freq
1 1 99
2 2 100
3 3 100
4 4 100
5 5 51
6 <NA> 50
关键技巧是通过将
useNA
参数添加到 table
来计算 NA。 +100 只是传递以 1 开头标记的值。