我汇总每组数据并计算每组均值以简化可视化。不幸的是,我的一些群组非常大,有些则相当空。我喜欢进行滚动均值计算以进一步平滑图片。这是类似的数据:
# load package
library(haven)
# read dta file from github
soep <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/soep_lebensz_en/soep_lebensz_en.dta?raw=true")
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n()) %>%
ggplot(aes(x = education, y = satisf_org, col = as.factor(sex))) +
geom_point() +
labs(title = "Mean Satisfaction per Education Level by Gender",
x = "Education", y = "Mean Satisfaction", color = "Gender")
女性对教育的平均满意度 8.5 看起来像是异常值。在每一年的教育中,我假设人们没有太大差异而无法总结,即计算所有受教育程度为 7、8.5 和 9(按性别分组)的人的平均满意度,并将其存储为滚动平均值 8.5(按性别分组) ).
从标准分组开始:
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n())
# A tibble: 28 × 4
# Groups: education [14]
education sex satisf_org n
<dbl> <dbl+lbl> <dbl> <int>
1 7 0 [male] 6.16 73
2 7 1 [female] 6.59 113
3 8.5 0 [male] 7.16 37
4 8.5 1 [female] 8.56 18
5 9 0 [male] 6.88 430
6 9 1 [female] 7.00 633
7 10 0 [male] 7.19 144
8 10 1 [female] 7.36 221
9 10.5 0 [male] 6.96 1538
10 10.5 1 [female] 7.02 1493
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows
这是我期望的数字
soep %>%
filter(sex == 1) %>% # only looks at females
filter(education %in% c(7:9)) %>% # take education level before and after
summarise(mean(satisf_org)) # calculate the "rolling mean" per group
# A tibble: 1 × 1
`mean(satisf_org)`
<dbl>
1 6.93
这是我期望每个值的每组滚动平均值,即 6.93 而不是 8.56。
PS:在我的真实数据中,我以年为单位调查年龄,并且我通常至少有一些人处于所有年龄段。所以滚动窗口可以是 -1 到 +1(数字)而不是领先/落后邻居。
你可以
group_by
做爱并在那里做滚动平均:
library(dplyr)
library(slider)
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n()) %>%
group_by(sex) %>%
mutate(rolling_mean = slide_dbl(satisf_org, mean, .before = 1, .after = 1))
输出
# A tibble: 28 × 5
# Groups: sex [2]
education sex satisf_org n rolling_mean
<dbl> <dbl+lbl> <dbl> <int> <dbl>
1 7 0 [male] 6.16 73 6.66
2 7 1 [female] 6.59 113 7.57
3 8.5 0 [male] 7.16 37 6.73
4 8.5 1 [female] 8.56 18 7.38
5 9 0 [male] 6.88 430 7.08
6 9 1 [female] 7.00 633 7.64
7 10 0 [male] 7.19 144 7.01
8 10 1 [female] 7.36 221 7.13
9 10.5 0 [male] 6.96 1538 7.14
10 10.5 1 [female] 7.02 1493 7.20
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows