我一直在尝试使用 str_extract 从我从世界贸易组织网站上抓取的数据中提取日期。问题是,无论出于什么原因,它总是返回 NA。然而,当我自己输入字符串时,该函数突然起作用了。对于发生的事情有什么想法吗?
> country_comparison$status[1:10]
[1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995" "Implementation notified by respondent on 25 September 1997"
[3] "In consultations on 4 April 1995" "Implementation notified by respondent on 25 September 1997"
[5] "Settled or terminated (withdrawn, mutually agreed solution) on 20 July 1995" "Settled or terminated (withdrawn, mutually agreed solution) on 19 July 1995"
[7] "Settled or terminated (withdrawn, mutually agreed solution) on 5 July 1996" "Mutually acceptable solution on implementation notified on 9 January 1998"
[9] "Panel established, but not yet composed on 11 October 1995" "Mutually acceptable solution on implementation notified on 9 January 1998"
> country_comparison$status[1:10] %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] NA NA NA NA NA NA NA NA NA NA
> c("Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995", "Implementation notified by respondent on 25 September 1997") %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] "29 March 1995" "25 September 1997"
有点猜测,但如果这些字符串是从 www.wto.org 中刮取的,并且第一个字符串源自 https://www.wto.org/english/tratop_e/dispu_e/cases_e/ds1_e.htm ,然后根据收集方式的不同,可能会有一些不间断的空格:
尝试将正则表达式中的
替换为 \\s
以匹配任何空格:
library(stringr)
s <- "Settled or terminated (withdrawn, mutually agreed solution) on 29\u00A0March\u00A01995"
# looks like a regular space:
s
#> [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995"
# until you check it with something that can highlight unusual whitespace:
stringr::str_view(s)
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on 29{\u00a0}March{\u00a0}1995
# replacing " " in regex with \\s:
str_view(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on <29{\u00a0}March{\u00a0}1995>
str_extract(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] "29 March 1995"
创建于 2023-09-23,使用 reprex v2.0.2