我正在尝试使用ons名称数据库(免费提供)将自由文本响应列中的名称替换为“Z”。但是,有些名称也是我不想删除的潜在重要单词/首字母缩略词,例如“My”、“He”和“Ta”。我已经从我的名称列表中反加入了这些单词,并且正在使用带有单词边界的正则表达式来尝试仅替换我想要的名称,但由于某种原因它仍然继续替换“Ta”,我不希望它这样做!这本身就是某种正则表达式模式/有人知道它为什么这样做或者如何修复它吗?非常感谢任何帮助!正则表达式不是我的强项。
# Download ONS baby names data (1996-2021) and save in the working folder
# Data source: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesinenglandandwalesfrom1996
filepath <- "insert your file path here"
library(tidyverse)
library(readxl)
library(textclean)
library(janitor)
#library(qdap)
#---Remove first names-----#####
# Read in ONS baby names data (1996-2021) and create a list of names
excel_sheets(paste0(filepath, "babynames1996to2021.xlsx"))
boynames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"),
"1", skip = 7) %>%
select(Name)
girlnames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"),
"2", skip = 7) %>%
select(Name)
#remove names which we don't want to replace from the text
firstnames <- bind_rows(boynames, girlnames) %>%
mutate(no_char = nchar(Name)) %>%
filter(no_char > 1) %>% #removes single letter names
select(word = Name) %>%
filter(word != "My") %>% #removes the name "My"
filter(word != "He") %>% #removes the name "He"
filter(word != "The") %>% #removes the name "The"
filter(word != "His") %>% #removes the name "His"
filter(word != "A") %>% #removes the name "A"
filter(word != "Now") %>% #removes the name "Now"
filter(word != "To") %>% #removes the name "To"
filter(word != "Ta") #removes the name "Ta"
#use \\b to set word boundaries to find exact match of entire name
firstnames$word2 <- paste0("\\b",firstnames$word,"\\b")
#test text
text <- "Some text with Zoha, Zohal, and Zuzia in it."
text2 <- "Some text with A-Jay, A.J. and Aaban in it!"
text3 <- "Some text with Ta, My, and He in it"
#text as a column in a tibble (akin to our real data)
test <- tibble(comment=c(text,text2,text3))
for(i in 1:length(firstnames$word2)){
test$comment <- gsub(firstnames$word2[i], "Z", test$comment)
}
test
#this removes Ta, which it shouldn't!`
简短回答; 将
,fixed=TRUE
添加到 gsub() 调用的末尾
长答案,你的单词包含标点符号,它们被解释为正则表达式,即 看看
firstnames$word[14757]
它
T.
所以匹配Ta(除非使用fixed)