这是我的以下问题和解决方案。这个问题的目的是找出是否有更简单的方法来完成以下过程,因为我觉得我的代码过于复杂;然而,我在 Stack Overflow、其他网站或 YouTube 上找不到更简单的解决方案。
我使用的数据集是按 2020 年筛选的洛杉矶犯罪统计数据,并浓缩为 10 个特征和 198,908 个观察值。这是数据集的 head():
LA2020 %>%
select(crime_description, victim_sex) %>%
head()
crime_description victim_sex
1 BATTERY - SIMPLE ASSAULT F
2 BATTERY - SIMPLE ASSAULT M
3 SEX OFFENDER REGISTRANT OUT OF COMPLIANCE X
4 VANDALISM - MISDEAMEANOR ($399 OR UNDER) F
5 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) X
6 RAPE, FORCIBLE F
我想做的就是将 ggplot 中的“crime_description”与 geom_bar 分组,并在“victime_sex”列上使用“fill”参数。 “crime_description”列中有 129 个不同的类别,因此在 geom_bar 中我只想显示从左到右从最高到最低排序的前 15 种犯罪。
我读过很多这个问题的解决方案,其中大多数的答案是先准备好前 15 个犯罪,然后再进入 ggplot。我能够将其绘制成图表;然而,排序和展示前 15 名对我来说是最大的障碍。我所要做的就是创建一个包含一系列“LEFTJOINS”的数据框。见下文。
LA2020 %>%
count(crime_description, sort = TRUE) %>%
rename(cnt = n) %>%
top_n(15, cnt) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "M") %>%
count(crime_description)),
by = "crime_description") %>%
rename(M = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "F") %>%
count(crime_description)),
by = "crime_description") %>%
rename(F = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "X") %>%
count(crime_description)),
by = "crime_description") %>%
rename(X = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "H") %>%
count(crime_description)),
by = "crime_description") %>%
rename(H = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "O") %>%
count(crime_description)),
by = "crime_description") %>%
rename(O = n)
{. ->> crime} ### saved as an object "crime"
crime_description cnt M F X H O
1 VEHICLE - STOLEN 20702 63 15 18 NA 20606
2 BATTERY - SIMPLE ASSAULT 16293 8407 7808 75 1 2
3 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 12885 6056 4251 2571 1 6
4 BURGLARY 12793 6478 3361 2943 NA 11
5 BURGLARY FROM VEHICLE 12675 7084 5317 265 4 5
6 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11498 8194 3073 223 2 6
7 THEFT PLAIN - PETTY ($950 & UNDER) 10816 5193 4505 1113 2 3
8 INTIMATE PARTNER - SIMPLE ASSAULT 10814 2705 8088 20 1 NA
9 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) 9704 2992 2189 148 NA 4375
10 THEFT OF IDENTITY 8786 4387 4312 82 2 3
11 VANDALISM - MISDEAMEANOR ($399 OR UNDER) 6951 3119 2738 1092 NA 2
12 ROBBERY 6882 4210 1775 889 1 7
13 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD 5492 2903 1986 597 NA 6
14 THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) 4767 2856 1673 237 NA 1
15 CRIMINAL THREATS - NO WEAPON DISPLAYED 4189 2008 2123 58 NA NA
如您所见,该数据框根据“cnt”列列出了前 15 名犯罪行为,后续列是总“cnt”的受害者性别分布。我将此数据框命名为“crime”,并在输入 ggplot 之前在过滤器中使用了前 15 个数据框。我基本上对前 15 名犯罪进行了分组、计数和排序,并保存为数据框,用作原始数据集“LA2020”的过滤标准
LA2020 %>%
filter(crime_description %in% crime$crime_description) %>%
ggplot(aes(x = fct_infreq(crime_description), fill = victim_sex)) +
geom_bar(alpha = 0.5) +
scale_x_discrete(label = abbreviate) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white")) +
labs(title = "Crime Distribution by Gender",
x = "Crime Description",
y = "Count",
fill = "Gender")
问题
有没有更简单的方法来完成我刚才所做的事情?我知道我不需要创建所有这些“LEFTJOIN”,并且可以只创建一个包含前 15 个的数据框,但我想在表格中显示 ggplot 将显示的内容。我可以在不创建单独的数据框对象并将其通过管道传输到 ggplot 的情况下执行此操作吗?
所有内容都已在原始问题中解释过
这里有一个简化代码的选项,它首先按犯罪和性别计算计数,然后筛选我使用
semi_join
的前 15 种犯罪。然后可以使用该数据集进行绘图。如果您需要更多类似表格的格式的数据,您可以使用例如tidyr::pivot_wider
重塑宽度。
使用一些虚假的随机示例数据:
library(ggplot2)
library(dplyr, warn = FALSE)
set.seed(123)
LA2020 <- data.frame(
crime_description = sample(letters, 1000, replace = TRUE),
victim_sex = sample(c("M", "F"), 1000, replace = TRUE)
)
top15 <- LA2020 |>
count(crime_description, victim_sex) |>
semi_join(
# Top 15 crimes
LA2020 |>
count(crime_description, sort = TRUE) |>
head(15),
by = "crime_description"
) |>
mutate(
crime_description = reorder(crime_description, -n, FUN = sum)
)
top15 |>
ggplot(
aes(crime_description, n, fill = victim_sex)
) +
geom_col() +
# scale_x_discrete(label = abbreviate) +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white")
) +
labs(
title = "Crime Distribution by Gender",
x = "Crime Description",
y = "Count",
fill = "Gender"
)
top15 |>
add_count(crime_description, wt = n, name = "cnt") |>
tidyr::pivot_wider(names_from = victim_sex, values_from = n)
#> # A tibble: 15 × 4
#> crime_description cnt F M
#> <fct> <int> <int> <int>
#> 1 c 43 25 18
#> 2 f 36 15 21
#> 3 g 45 23 22
#> 4 h 51 24 27
#> 5 j 52 26 26
#> 6 k 39 16 23
#> 7 n 37 10 27
#> 8 q 37 19 18
#> 9 t 41 18 23
#> 10 u 37 16 21
#> 11 v 37 17 20
#> 12 w 48 23 25
#> 13 x 37 19 18
#> 14 y 49 24 25
#> 15 z 38 21 17