R：修剪一个非常长的字符串，其中包含完整的单词（包含开头和结尾）

Question

假设我有这个数据框：

df =data.frame(text=c("This is a very long sentence that I would like to trim because I might need to put it as a label somewhere",
               "This is another very long sentence that I would also like to trim because I might need to put it as who knows what"),col2=c("1234","5678"))

在this帖子之后，我已经能够获得一个新专栏，它可以让我以完整的单词开始句子，这很好。

df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20], collapse = ' '))

> df$short_txt
[1] "This is a very long"  "This is another very"

但是，我也有兴趣粘贴结尾前 20 个字符的完整单词的结尾，以获得与此输出接近的内容。

> df$short_txt
[1] "This is a very long...it as a label somewhere"  "This is another very...it as who knows what"

我不知道如何完成

sapply

函数来达到这个结果。我尝试使用粘贴功能并将

cumsum

功能更改为

df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20],"...",i[cumsum(nchar(i)) >= (nchar(i)-20)], collapse = ' '))

但它没有返回我想要的。

感谢您的帮助。

Answer 1

也许我们可以正则表达式？

gsub("^(.{20}\\S*)\\b.*\\b(\\S*.{20})$", "\\1...\\2", df$text)
# [1] "This is a very long sentence...as a label somewhere" "This is another very...it as who knows what"

正则表达式解释：

^(.{20}\\S*)\\b.*\\b(\\S*.{20})$
^                              $   beginning and end of string, respectively
 (.........)        (.........)    first and second saved groups
  .{20}                  .{20}     exactly 20 characters of any kind
       \\S*          \\S*          zero or more non-space characters
            \\b  \\b               word boundaries
               .*                  anything else (including nothing)

这不包括开头的

it

，因为没有它，子字符串的长度为 20。

我将查看

df$text[1]

以及前导/尾随完整单词子串的各种数字。

sapply(seq(10, 24, by = 2), function(len) gsub(sprintf("^(.{%d}\\S*)\\b.*\\b(\\S*.{%d})$", len, len), "\\1...\\2", df$text[1]))
# [1] "This is a very... somewhere"                            
# [2] "This is a very...label somewhere"                       
# [3] "This is a very...label somewhere"                       
# [4] "This is a very long... label somewhere"                 
# [5] "This is a very long... a label somewhere"               
# [6] "This is a very long sentence...as a label somewhere"    
# [7] "This is a very long sentence...it as a label somewhere" 
# [8] "This is a very long sentence... it as a label somewhere"

R：修剪一个非常长的字符串，其中包含完整的单词（包含开头和结尾）

问题描述投票：0回答：1

1个回答

最新问题

R：修剪一个非常长的字符串，其中包含完整的单词（包含开头和结尾）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1