为什么 stringr::str_extract 对于某个字符向量总是返回 NA

问题描述 投票:0回答:1

我一直在尝试使用 str_extract 从我从世界贸易组织网站上抓取的数据中提取日期。问题是,无论出于什么原因,它总是返回 NA。然而,当我自己输入字符串时,该函数突然起作用了。对于发生的事情有什么想法吗?

> country_comparison$status[1:10]
 [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995" "Implementation notified by respondent on 25 September 1997"                  
 [3] "In consultations on 4 April 1995"                                             "Implementation notified by respondent on 25 September 1997"                  
 [5] "Settled or terminated (withdrawn, mutually agreed solution) on 20 July 1995"  "Settled or terminated (withdrawn, mutually agreed solution) on 19 July 1995" 
 [7] "Settled or terminated (withdrawn, mutually agreed solution) on 5 July 1996"   "Mutually acceptable solution on implementation notified on 9 January 1998"   
 [9] "Panel established, but not yet composed on 11 October 1995"                   "Mutually acceptable solution on implementation notified on 9 January 1998"   

> country_comparison$status[1:10] %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
 [1] NA NA NA NA NA NA NA NA NA NA

> c("Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995", "Implementation notified by respondent on 25 September 1997") %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] "29 March 1995"     "25 September 1997"
r stringr
1个回答
0
投票

有点猜测,但如果这些字符串是从 www.wto.org 中刮取的,并且第一个字符串源自 https://www.wto.org/english/tratop_e/dispu_e/cases_e/ds1_e.htm ,然后根据收集方式的不同,可能会有一些不间断的空格:

尝试将正则表达式中的

 
替换为
\\s
以匹配任何空格:

library(stringr)
s <- "Settled or terminated (withdrawn, mutually agreed solution) on 29\u00A0March\u00A01995"
# looks like a regular space:
s
#> [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995"

# until you check it with something that can highlight unusual whitespace:
stringr::str_view(s)
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on 29{\u00a0}March{\u00a0}1995

# replacing " " in regex with \\s:
str_view(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on <29{\u00a0}March{\u00a0}1995>
str_extract(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] "29 March 1995"

创建于 2023-09-23,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.