从字符串向量数据中提取单词字符串

问题描述 投票:1回答:2

我有一个字符串向量数据,如下所示

Data
Posted by Mohit Garg on May 7, 2016
Posted by Dr. Lokesh Garg on April 8, 2018
Posted by Lokesh.G.S  on June 11, 2001
Posted by Mohit.G.S. on July 23, 2005
Posted by Dr.Mohit G Kumar Saha on August 2, 2019

我已将str_extract()函数用作

str_extract(Data, "Posted by \\w+. \\w+ \\w+")

它生成的输出为

[1] "Posted by Mohit Garg on"   "Posted by Dr. Lokesh Garg" NA                         
[4] NA                          NA  

我想要输出喜欢

[1] "Posted by Mohit Garg on"   "Posted by Dr. Lokesh Garg"  "Posted by Lokesh.G.S"                       
[4] "Posted by Mohit.G.S."                     "Posted by Dr.Mohit G Kumar Saha"
r regex stringr
2个回答
1
投票

您可以使用sub并用on删除*on.*及其后的所有内容。

sub(" +?on.*$", "", Data)
#[1] "Posted by momon"                 "Posted by on Mohit Garg"        
#[3] "Posted by Dr. Lokesh Garg"       "Posted by Lokesh.G.S"           
#[5] "Posted by Mohit.G.S."            "Posted by Dr.Mohit G Kumar Saha"

数据:

Data <- c("Posted by momon on Monday 29 Feb 2020"
, "Posted by on Mohit Garg on May 7, 2016"
, "Posted by Dr. Lokesh Garg on April 8, 2018"
, "Posted by Lokesh.G.S  on June 11, 2001"
, "Posted by Mohit.G.S. on July 23, 2005"
, "Posted by Dr.Mohit G Kumar Saha on August 2, 2019")

2
投票

可能您可以尝试:

stringr::str_extract(df$Data, "Posted by .+?(?=\\s+on)")

#[1] "Posted by Mohit Garg" "Posted by Dr. Lokesh Garg"  "Posted by Lokesh.G.S"
#[4] "Posted by Mohit.G.S." "Posted by Dr.Mohit G Kumar Saha"

这将从"Posted by""on"中提取所有内容,但不包括"on"


与基数R相同:

sub(".*(Posted by .+?)(?=\\s+on).*", '\\1', df$Data, perl = TRUE) 

数据

df <- structure(list(Data = c("Posted by Mohit Garg on May 7, 2016", 
"Posted by Dr. Lokesh Garg on April 8, 2018", "Posted by Lokesh.G.S  on June 11, 2001", 
"Posted by Mohit.G.S. on July 23, 2005", "Posted by Dr.Mohit G Kumar Saha on August 2, 2019"
)), class = "data.frame", row.names = c(NA, -5L))
© www.soinside.com 2019 - 2024. All rights reserved.