从多行中提取标题

问题描述 投票:0回答:1

我有多个文件,每个文件都有不同的标题,我想从每个文件中提取标题名称。这是一个文件的示例

[1] "<START"                        "ID=\"CMP-001\""                  "NO=\"1\">"                         
[4] "<NAME>Plasma-derived"          "vaccine"                         "(PDV)"                             
[7] "versus"                        "placebo"                         "by"                                
[10] "intramuscular"                "route</NAME>"                    "<DIC"                     
[13] "CHI2=\"3.6385\""              "CI_END=\"0.6042\""               "CI_START=\"0.3425\""   
[16] "CI_STUDY=\"95\""                "CI_TOTAL=\"95\""               "DF=\"3.0\""                        
[19] "TOTAL_1=\"0.6648\""           "TOTAL_2=\"0.50487622\""           "BLE=\"YES\"" 
.
.
.
 [789] "TOTAL_2=\"39\""             "WEIGHT=\"300.0\""              "Z=\"1.5443\">"    
 [792] "<NAME>Local"                "adverse"                       "events" 
 [795] "after"                      "each"                          "injection"
 [798] "of"                         "vaccine</NAME>"               "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
 [801] "</GROUP_LABEL_2>"           "<GRAPH_LABEL_1>"              "PDV</GRAPH_LABEL_1>"

提取的期望标题是

Plasma-derived vaccine (PDV) versus placebo by intramuscular route

注意,每个文件都有不同的标题长度。

r extraction
1个回答
0
投票

这里是使用stringr的解决方案。这首先将向量折叠成一个长字符串,然后捕获每对\n"<NAME>"之间不是换行符"</NAME>"的所有单词/字符。将来,如果您创建reproducible example(例如,使用dput()),人们将能够更轻松地为您提供帮助。希望这会有所帮助!

library(stringr)

str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine" 

数据

string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
            "TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")
© www.soinside.com 2019 - 2024. All rights reserved.