将多个单词列表中的元素替换为gsub()变成语料库

问题描述 投票:-1回答:1

我有一个233个文档的语料库(ecb_corpus)和一个多单词列表(ecb_final)。我想将多词列表中的每个词组和词组替换为我的语料库。

这是我的多词列表:

1   euro_area
2   monetary_policy
3   price_stability
4   interest_rates
5   second_question
6   medium_term
7   first_question
8   central_banks
9   inflation_expectations
10  structural_reforms

我只是通过使用gsub在单个案例中做到了:

ecb_ready <- gsub(pattern = "interest rate", replacement= "interest_rates", ecb_corpus, ignore.case = TRUE, perl = FALSE, fixed = TRUE)

要获得我想要的结果,在[[pattern中应该有语料库的任何词(ecb_corpus),在replacement中我的多词列表(ecb_final)。我一直在尝试完全失败的循环(R相当陌生,但不幸的是还无法执行)。

有没有人可以帮助我循环播放?

非常感谢!

r loops gsub
1个回答
1
投票
stringr::str_replace_all()可以直接执行此操作。这就是帮助文件试图与“在stringpatternreplacement上矢量化”的简短通讯。

这里我假设您的语料库存储在一个字符向量中,但是它也可以是一个字符列表。如果更复杂(例如,使用JSON ...),则可能需要先进行一些预处理,然后再将其输入str_replace_all()

请注意,结果删除了输入元素的名称,但是恢复它们很容易。

library(tidyverse) (ecb_corpus <- c( doc_1 = c("lorem ipsum interest rate gobbledygook"), doc_2 = c("lorem dolor central bank foobar") )) #> doc_1 #> "lorem ipsum interest rate gobbledygook" #> doc_2 #> "lorem dolor central bank foobar" replacements <- c("euro_area", "monetary_policy", "price_stability", "interest_rates", "second_question", "medium_term", "first_question", "central_banks", "inflation_expectations", "structural_reforms") targets <- replacements %>% str_replace_all("_", " ") %>% str_remove("s$") (replacement_pairs <- replacements %>% set_names(targets)) #> euro area monetary policy price stability #> "euro_area" "monetary_policy" "price_stability" #> interest rate second question medium term #> "interest_rates" "second_question" "medium_term" #> first question central bank inflation expectation #> "first_question" "central_banks" "inflation_expectations" #> structural reform #> "structural_reforms" (ecb_ready <- ecb_corpus %>% str_replace_all(replacement_pairs)) #> [1] "lorem ipsum interest_rates gobbledygook" #> [2] "lorem dolor central_banks foobar"

reprex package(v0.3.0)创建于2019-09-28

© www.soinside.com 2019 - 2024. All rights reserved.