全文中的标记化，用与号表示

Question

我当前正在使用unnest_tokens()程序包中的tidytext功能。它完全按照我的需要工作，但是，它从文本中删除了“＆”号。我希望它不这样做，但其他所有内容保持不变。

例如：

library(tidyverse)
library(tidytext)

d <- tibble(txt = "Let's go to the Q&A about B&B, it's great!")
d %>% unnest_tokens(word, txt, token="words")

当前返回

# A tibble: 11 x 1
   word 
   <chr>
 1 let's
 2 go   
 3 to   
 4 the  
 5 q    
 6 a    
 7 about
 8 b    
 9 b    
10 it's 
11 great

但我希望它返回

# A tibble: 9 x 1
  word 
  <chr>
1 let's
2 go   
3 to   
4 the  
5 q&a       
6 about
7 b&b
8 it's
9 great

是否可以将选项发送到unnest_tokens()来执行此操作，或发送它当前使用的正则表达式并手动将其调整为不包含＆符？

Answer 1

我们可以将token用作regex

library(tidytext)
library(dplyr)
d %>% 
   unnest_tokens(word, txt, token="regex")
# A tibble: 7 x 1
#  word 
#  <chr>
#1 let's
#2 go   
#3 to   
#4 the  
#5 q&a  
#6 about
#7 b&b

全文中的标记化，用与号表示

问题描述投票：0回答：1

1个回答

最新问题

全文中的标记化，用与号表示

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1