我有一个233个文档的语料库(ecb_corpus)和一个多单词列表(ecb_final)。我想将多词列表中的每个词组和词组替换为我的语料库。
这是我的多词列表:
1 euro_area
2 monetary_policy
3 price_stability
4 interest_rates
5 second_question
6 medium_term
7 first_question
8 central_banks
9 inflation_expectations
10 structural_reforms
我只是通过使用gsub在单个案例中做到了:
ecb_ready <- gsub(pattern = "interest rate", replacement= "interest_rates", ecb_corpus, ignore.case = TRUE, perl = FALSE, fixed = TRUE)
要获得我想要的结果,在[[pattern中应该有语料库的任何词(ecb_corpus),在replacement中我的多词列表(ecb_final)。我一直在尝试完全失败的循环(R相当陌生,但不幸的是还无法执行)。
有没有人可以帮助我循环播放?非常感谢!
stringr::str_replace_all()
可以直接执行此操作。这就是帮助文件试图与“在string
,pattern
和replacement
上矢量化”的简短通讯。这里我假设您的语料库存储在一个字符向量中,但是它也可以是一个字符列表。如果更复杂(例如,使用JSON ...),则可能需要先进行一些预处理,然后再将其输入str_replace_all()
。
请注意,结果删除了输入元素的名称,但是恢复它们很容易。
library(tidyverse)
(ecb_corpus <- c(
doc_1 = c("lorem ipsum interest rate gobbledygook"),
doc_2 = c("lorem dolor central bank foobar")
))
#> doc_1
#> "lorem ipsum interest rate gobbledygook"
#> doc_2
#> "lorem dolor central bank foobar"
replacements <- c("euro_area",
"monetary_policy",
"price_stability",
"interest_rates",
"second_question",
"medium_term",
"first_question",
"central_banks",
"inflation_expectations",
"structural_reforms")
targets <- replacements %>% str_replace_all("_", " ") %>% str_remove("s$")
(replacement_pairs <- replacements %>% set_names(targets))
#> euro area monetary policy price stability
#> "euro_area" "monetary_policy" "price_stability"
#> interest rate second question medium term
#> "interest_rates" "second_question" "medium_term"
#> first question central bank inflation expectation
#> "first_question" "central_banks" "inflation_expectations"
#> structural reform
#> "structural_reforms"
(ecb_ready <- ecb_corpus %>% str_replace_all(replacement_pairs))
#> [1] "lorem ipsum interest_rates gobbledygook"
#> [2] "lorem dolor central_banks foobar"
由reprex package(v0.3.0)创建于2019-09-28