如何做反向anti_join?

问题描述 投票:0回答:1

我有一个看起来像这样的变量:

variable

我想让受害者的国籍“蹦出来”。因此,“乌克兰国家实体”将仅显示为“乌克兰”。

于是我做了一个dataframe,恶魔名对应国名。我没有文本挖掘方面的经验(老实说 R 也没有),所以我使用了我在课堂上看到的东西并尝试将它们放在一起。

这是我的推理:

  1. 将“受害者”分成单独的词:
d_tokenized = state_cyberattacks_csv %>%
  filter(Category == 'Government')%>%
  select(Date, Sponsor, Victims) %>%
  unnest_tokens(word, Victims)
  1. 删除没有出现在 Demonym 数据框的“Demonym”列中的单词
d_tokenized_s = d_tokenized %>%
  anti_join(demonym_list, by != "Demonym")

我知道由于“!=”它不起作用,因为它没有意义。我试图找到其他方法,使用 join、str_extract、str_subset 等……但老实说,我不明白他们在做什么。

我应该使用哪个功能?

此外,直接具有国家名称而不是 demonym 的条目存在问题,如果我确实找到一种方法来使用类似于 anti_join 的方法来删除与“Demonym”不匹配的内容,它将被删除。

r text-mining
1个回答
0
投票
library(tidyverse)

df <- structure(list(Victims = c("Ukrainian state entities", "Russian and Belarusian websites were targeted, including th...", 
                                 "Belgian Federal Public Service Interior", "Ukrainian government agencies", 
                                 "Government agencies of EU member states", "Two research institutes run by Rostec", 
                                 "Albanian government networks", "Cryptocurrency applications", 
                                 "VMware Horizon servers", "Cryptocurrency company employees", 
                                 "Individual suspects within Canadian police investigations.")), class = "data.frame", row.names = c(NA, 
                                                                                                                                     -11L))

如果只需要更换“乌克兰国家实体”。

df |> mutate(Victims = str_replace(Victims, "Ukrainian state entities", "Ukraine"))
#>                                                           Victims
#> 1                                                         Ukraine
#> 2  Russian and Belarusian websites were targeted, including th...
#> 3                         Belgian Federal Public Service Interior
#> 4                                   Ukrainian government agencies
#> 5                         Government agencies of EU member states
#> 6                           Two research institutes run by Rostec
#> 7                                    Albanian government networks
#> 8                                     Cryptocurrency applications
#> 9                                          VMware Horizon servers
#> 10                               Cryptocurrency company employees
#> 11     Individual suspects within Canadian police investigations.

如果所有带“乌克兰语”的都需要换掉

df |> mutate(Victims = case_when(
  str_detect(Victims, "Ukrainian") ~ "Ukraine",
  TRUE ~ Victims)
)
#>                                                           Victims
#> 1                                                         Ukraine
#> 2  Russian and Belarusian websites were targeted, including th...
#> 3                         Belgian Federal Public Service Interior
#> 4                                                         Ukraine
#> 5                         Government agencies of EU member states
#> 6                           Two research institutes run by Rostec
#> 7                                    Albanian government networks
#> 8                                     Cryptocurrency applications
#> 9                                          VMware Horizon servers
#> 10                               Cryptocurrency company employees
#> 11     Individual suspects within Canadian police investigations.

创建于 2023-04-21 与 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.