我有下面的数据框。我想找出每个“基准”所在的“值”的百分位数。例如,“基准”为100大约是“值”的第75个百分点。
group <- c(1,1,1,2,2,2)
benchmark <- c(100,100,100,200,200,200)
value <- c(50,80,120,150,230,250)
d_f <- data.frame(group,benchmark, value)
d_f %>%
group_by(group, benchmark) %>%
summarise(q25 = quantile(value, 0.25),
q50 = quantile(value, 0.50),
q75 = quantile(value, 0.75)
# can add more percentile
)
谢谢!
我认为您需要ecdf
。剩下的问题(对我而言)是您的经验累积分布是按组还是整体进行。
每组:
d_f %>%
group_by(group, benchmark) %>%
mutate(bench_pctile = ecdf(value)(benchmark) * 100)
# # A tibble: 6 x 4
# # Groups: group, benchmark [2]
# group benchmark value bench_pctile
# <dbl> <dbl> <dbl> <dbl>
# 1 1 100 50 66.7
# 2 1 100 80 66.7
# 3 1 100 120 66.7
# 4 2 200 150 33.3
# 5 2 200 230 33.3
# 6 2 200 250 33.3
或者在整列中,我们需要在分组之前调用ecdf
:
valecdf <- ecdf(d_f$value)
d_f %>%
group_by(group, benchmark) %>%
mutate(bench_pctile = valecdf(benchmark) * 100)
# # A tibble: 6 x 4
# # Groups: group, benchmark [2]
# group benchmark value bench_pctile
# <dbl> <dbl> <dbl> <dbl>
# 1 1 100 50 33.3
# 2 1 100 80 33.3
# 3 1 100 120 33.3
# 4 2 200 150 66.7
# 5 2 200 230 66.7
# 6 2 200 250 66.7
一种支持这种方法的方法是近似:
### grouped
mean(100 <= d_f$value[1:3])
# [1] 0.3333333
mean(200 <= d_f$value[4:6])
# [1] 0.6666667
### ungrouped
mean(100 <= d_f$value)
# [1] 0.6666667
mean(200 <= d_f$value)
# [1] 0.3333333