在R中，使用正则表达式提取标题之间的文本。

Question

我想提取章节标题之间的所有文本，包括第一个开头的标题，但不包括结尾的标题。标题总是大写的，前面总是有一个数字句号或数字-字母-句号组合，后面总是有空格。我想保留副标题（如 "6.1"，"7A.1"）作为提取字符串的一部分。下面是一些示例文本。

example <- "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. 6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. 7A WARNING 7A.1 Do not forget to warn passengers."

# The output I want is:

"5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac."

"6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'."

"7A WARNING 7A.1 Do not forget to warn passengers."

使用 stringr 包，并借助于这个职位，我得到了这一点。

library(stringr)
str_extract_all(example, "(\\d+\\w?\\.?[:blank:]+[:upper:]+)(.*?)(?=\\d+\\w?\\.?[:blank:]+[:upper:]+)")

# Explanation of my regex code:
# (\\d+\\w?\\.?[[:blank:]]+[[:upper:]])
# \\d+   one or more digits
# \\w?   zero or one letter
# \\.?   zero or one period
# [:blank:]+   one or more space/tab
# [:upper]+    one or more capital letters

# (.*?)   non-greedy capture, zero or one or more of any character

# (?=\\d+\\w?\\.?[:blank:]+[:upper:]+)
# ?=   followed by
# \\d+   one or more digits
# \\w?   zero or one letter
# \\.?   zero or one period
# [:blank:]+   one or more space/tab
# [:upper]+    one or more capital letters

这很接近我想要的东西，只有两件事出了问题。第一是 "6.1 "被拆成了 "6. "和 "1"。第二是最后一章标题后的文字没有被捕捉到，看起来可能和 "6.1 "一样被分割了。

[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. "
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6."                                  
[3] "1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. "                                              
[4] "7A WARNING 7A."

我到底哪里出错了？

Answer 1

您可以使用

example <- "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. 6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled 'wheel mechanism'. 7A WARNING 7A.1 Do not forget to warn passengers."

library(stringr)
str_split(example, "(?!^)(?<!\\d[.A-Z])(?<!\\d[A-Z]\\.)\\b(?=\\d+(?:[a-zA-Z]|\\.)\\s+\\p{Lu})")

输出。

[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non-emergency landings) on tarmac. "                                       
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the 2 Switch labelled 'wheel mechanism'. "
[3] "7A WARNING 7A.1 Do not forget to warn passengers."

见 R演示和搜索引擎演示.

详情

(?!^) - 不在字符串的开头
(?<!\d[.A-Z]) - 不含
(?<!\d[A-Z]\.) - 不带头
\b - 匹配一个词的边界位置是...
(?=\d+(?:[a-zA-Z]|\.)\s+\p{Lu}) - 后面跟着1个以上的数字，然后是一个字母或一个点，然后是1个以上的空格和一个大写字母。

Answer 2

这个也可以。

str_extract_all(example, "\\d[.A-Z\\d\\s]+[A-Z]{2,}[\\s(.\\w]+")
[[1]]
[1] "5. SCOPE This document outlines what to do in case of emergency landing (ignore for non"                                                    
[2] "6. WHEELS Never land on tarmac. Unless you have lowered the plane wheel mechanism. 6.1 Lower the wheel mechanism using the switch labelled "
[3] "7A WARNING 7A.1 Do not forget to warn passengers."

在R中，使用正则表达式提取标题之间的文本。

问题描述投票：0回答：1

1个回答

最新问题

在R中，使用正则表达式提取标题之间的文本。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1