我正在使用sparklyr
并且有一个火花数据框,其中包含一个包含单词的word
t列,其中一些包含我要删除的特殊字符。我在特殊字符之前使用regepx_replace
和\\\\
是成功的,就像这样:
words.sdf <- words.sdf %>%
mutate(word = regexp_replace(word, '\\\\(', '')) %>%
mutate(word = regexp_replace(word, '\\\\)', '')) %>%
mutate(word = regexp_replace(word, '\\\\+', '')) %>%
mutate(word = regexp_replace(word, '\\\\?', '')) %>%
mutate(word = regexp_replace(word, '\\\\:', '')) %>%
mutate(word = regexp_replace(word, '\\\\;', '')) %>%
mutate(word = regexp_replace(word, '\\\\!', ''))
现在我想删除\
。我试过了两个:
words.sdf <- words.sdf %>%
mutate(word = regexp_replace(word, '\\\\\', ''))
并且:
words.sdf <- words.sdf %>%
mutate(word = regexp_replace(word, '\', ''))
但两者都不会起作用......
您必须更正R-side和Java端转义的代码,所以你需要的是"\\\\\\\\"
:
df <- copy_to(sc, tibble(word = "(abc\\zyx: 1)"))
df %>% mutate(regexp_replace(word, "\\\\\\\\", ""))
# Source: lazy query [?? x 2]
# Database: spark_shell_connection
word `regexp_replace(word, "\\\\\\\\\\\\\\\\", "")`
<chr> <chr>
1 "(abc\\zyx:1)" (abczyx: 1)
根据您的具体要求,可能更容易一次匹配所有字符。例如,您可以只保留单词字符(\w
)和空格(\s
):
df %>% mutate(regexp_replace(word, "[^\\\\w+\\\\s+]", ""))
# Source: lazy query [?? x 2]
# Database: spark_shell_connection
word `regexp_replace(word, "[^\\\\\\\\w+\\\\\\\\s+]", "")`
<chr> <chr>
1 "(abc\\zyx: 1)" abczyx 1
或仅限字符
df %>% mutate(regexp_replace(word, "[^\\\\w+]", ""))
# Source: lazy query [?? x 2]
# Database: spark_shell_connection
word `regexp_replace(word, "[^\\\\\\\\w+]", "")`
<chr> <chr>
1 "(abc\\zyx: 1)" abczyx1