两个分类变量,在ggplot中排序并显示top_15

问题描述 投票:0回答:1

这是我的以下问题和解决方案。这个问题的目的是找出是否有更简单的方法来完成以下过程,因为我觉得我的代码过于复杂;然而,我在 Stack Overflow、其他网站或 YouTube 上找不到更简单的解决方案。

我使用的数据集是按 2020 年筛选的洛杉矶犯罪统计数据,并浓缩为 10 个特征和 198,908 个观察值。这是数据集的 head():

LA2020 %>% 
  select(crime_description, victim_sex) %>% 
  head()
                                        crime_description victim_sex
1                                BATTERY - SIMPLE ASSAULT          F
2                                BATTERY - SIMPLE ASSAULT          M
3               SEX OFFENDER REGISTRANT OUT OF COMPLIANCE          X
4                VANDALISM - MISDEAMEANOR ($399 OR UNDER)          F
5 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)          X
6                                          RAPE, FORCIBLE          F

我想做的就是将 ggplot 中的“crime_description”与 geom_bar 分组,并在“victime_sex”列上使用“fill”参数。 “crime_description”列中有 129 个不同的类别,因此在 geom_bar 中我只想显示从左到右从最高到最低排序的前 15 种犯罪。

我读过很多这个问题的解决方案,其中大多数的答案是先准备好前 15 个犯罪,然后再进入 ggplot。我能够将其绘制成图表;然而,排序和展示前 15 名对我来说是最大的障碍。我所要做的就是创建一个包含一系列“LEFTJOINS”的数据框。见下文。

LA2020 %>% 
  count(crime_description, sort = TRUE) %>%
  rename(cnt = n) %>% 
  top_n(15, cnt) %>%
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "M") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(M = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "F") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(F = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "X") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(X = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "H") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(H = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "O") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(O = n)
  {. ->> crime} ### saved as an object "crime"
                                          crime_description   cnt    M    F    X  H     O
1                                          VEHICLE - STOLEN 20702   63   15   18 NA 20606
2                                  BATTERY - SIMPLE ASSAULT 16293 8407 7808   75  1     2
3   VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 12885 6056 4251 2571  1     6
4                                                  BURGLARY 12793 6478 3361 2943 NA    11
5                                     BURGLARY FROM VEHICLE 12675 7084 5317  265  4     5
6            ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11498 8194 3073  223  2     6
7                        THEFT PLAIN - PETTY ($950 & UNDER) 10816 5193 4505 1113  2     3
8                         INTIMATE PARTNER - SIMPLE ASSAULT 10814 2705 8088   20  1    NA
9           THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)  9704 2992 2189  148 NA  4375
10                                        THEFT OF IDENTITY  8786 4387 4312   82  2     3
11                 VANDALISM - MISDEAMEANOR ($399 OR UNDER)  6951 3119 2738 1092 NA     2
12                                                  ROBBERY  6882 4210 1775  889  1     7
13 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD  5492 2903 1986  597 NA     6
14      THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)  4767 2856 1673  237 NA     1
15                   CRIMINAL THREATS - NO WEAPON DISPLAYED  4189 2008 2123   58 NA    NA

如您所见,该数据框根据“cnt”列列出了前 15 名犯罪行为,后续列是总“cnt”的受害者性别分布。我将此数据框命名为“crime”,并在输入 ggplot 之前在过滤器中使用了前 15 个数据框。我基本上对前 15 名犯罪进行了分组、计数和排序,并保存为数据框,用作原始数据集“LA2020”的过滤标准

LA2020 %>% 
  filter(crime_description %in% crime$crime_description) %>% 
  ggplot(aes(x = fct_infreq(crime_description), fill = victim_sex)) +
  geom_bar(alpha = 0.5) +
  scale_x_discrete(label = abbreviate) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white")) +
  labs(title = "Crime Distribution by Gender",
       x = "Crime Description",
       y = "Count",
       fill = "Gender")

Crime Distribution by Gender, LA2020

问题

有没有更简单的方法来完成我刚才所做的事情?我知道我不需要创建所有这些“LEFTJOIN”,并且可以只创建一个包含前 15 个的数据框,但我想在表格中显示 ggplot 将显示的内容。我可以在不创建单独的数据框对象并将其通过管道传输到 ggplot 的情况下执行此操作吗?

所有内容都已在原始问题中解释过

r sorting ggplot2 greatest-n-per-group geom-bar
1个回答
0
投票

这里有一个简化代码的选项,它首先按犯罪和性别计算计数,然后筛选我使用

semi_join
的前 15 种犯罪。然后可以使用该数据集进行绘图。如果您需要更多类似表格的格式的数据,您可以使用例如
tidyr::pivot_wider
重塑宽度。

使用一些虚假的随机示例数据:

library(ggplot2)
library(dplyr, warn = FALSE)

set.seed(123)

LA2020 <- data.frame(
  crime_description = sample(letters, 1000, replace = TRUE),
  victim_sex = sample(c("M", "F"), 1000, replace = TRUE)
)

top15 <- LA2020 |>
  count(crime_description, victim_sex) |>
  semi_join(
    # Top 15 crimes
    LA2020 |>
      count(crime_description, sort = TRUE) |>
      head(15),
    by = "crime_description"
  ) |>
  mutate(
    crime_description = reorder(crime_description, -n, FUN = sum)
  ) 

top15 |>
  ggplot(
    aes(crime_description, n, fill = victim_sex)
  ) +
  geom_col() +
  # scale_x_discrete(label = abbreviate) +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white")
  ) +
  labs(
    title = "Crime Distribution by Gender",
    x = "Crime Description",
    y = "Count",
    fill = "Gender"
  )


top15 |> 
  add_count(crime_description, wt = n, name = "cnt") |> 
  tidyr::pivot_wider(names_from = victim_sex, values_from = n)
#> # A tibble: 15 × 4
#>    crime_description   cnt     F     M
#>    <fct>             <int> <int> <int>
#>  1 c                    43    25    18
#>  2 f                    36    15    21
#>  3 g                    45    23    22
#>  4 h                    51    24    27
#>  5 j                    52    26    26
#>  6 k                    39    16    23
#>  7 n                    37    10    27
#>  8 q                    37    19    18
#>  9 t                    41    18    23
#> 10 u                    37    16    21
#> 11 v                    37    17    20
#> 12 w                    48    23    25
#> 13 x                    37    19    18
#> 14 y                    49    24    25
#> 15 z                    38    21    17
© www.soinside.com 2019 - 2024. All rights reserved.