提取两个字符之间的前两个单词,然后删除字符之间的所有内容

问题描述 投票:0回答:1

这里是相对较新的 R 用户。我正在处理报纸文章,我正在尝试从给编辑的信的末尾提取作者姓名。这是我的数据结构示例。下面显示的是两个文本字符串的结尾(每个字符串大约有 1,000 个单词,因此为了方便起见,我只包含字符串的结尾)。

library(tidyr)
library(stringr)
library(stringi)

a$content <- c("theirs to bear.Harvey Fierstein is an actor and playwright.", 
"young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.")

我正在尝试提取作者,所以我想要的是取出第二个到最后一个出现的句号 (.) 和单词“is”之间的前两个单词,然后将提取的单词移动到新列

author
。第一个句点和第一个单词之间没有空格。

正确的输出应如下所示:

print(a$author)

[1] "Harvey Fierstein"
[2] "Siddarth Dhanvant Shanghvi"

这是我迄今为止尝试过的(有一些变化),但它正在返回

NA

a  <- a  %>%
  mutate(author = str_extract(content, "\\.[[:alpha:]]+ [[:alpha:]]+[is]]$"))

提取作者姓名后,我想从

content
中删除整个最后一句话,这样我就只剩下:

print(a$content)

[1] "theirs to bear."
[2] "young nation's love."
r regex
1个回答
0
投票

在您的实际应用中,您仍然需要做很多清洁工作! 我相信您的文本将包括换行符、不同的标点符号,并且其作者可能有三个甚至更多的名字,或者可能只有一个。我的玩具数据包括一些案例。 此外,最后一句中可能会有缩写名称和其他通用缩写,这将使其更具挑战性。

无论如何,这里有两种选择:
第一个提取最后一句话的前两个单词。 第二个提取第一个停用词之前的所有单词。 为此,您可以创建自己的矢量或使用

tm
包中的矢量。

检查一下(最后的玩具数据

authors_df
tm::stopwords
)。

library(tidyverse) # and maybe `tm`

# ------------------------------------
# Regexes 
# Mind the "dotall (?s)" flag
punct <- "[\\.\\!\\?]{1,5}"
last_sentence <- str_glue("(?s)(?<={punct})[^\\.\\!\\?]+?{punct}$")

# ------------------------------------
authors_df <- authors_df %>% 
  transmute(

    # Just a helper: extracts the last sentence
    # `str_squish` helps to avoid bigger regex
    last_sentence = content %>% 
      str_extract(last_sentence) %>% 
      str_squish(), 
    
    # Two-words solution
    author_first_two = last_sentence %>% 
      str_remove(punct) %>% 
      word(1, 2), # from `stringr` pkg but without "str_"
    
    #'Till stopwords solution
    author_stopwords = last_sentence %>% 
      str_extract_all(boundary("word")) %>%
      map_chr(
        \(x) x %>% 
          head_while (\(xx) xx %in% tm::stopwords() == FALSE) %>%
          str_flatten(" ")),
    
    # Remove the last sentence
    clean_content = content %>% 
      str_remove(last_sentence) %>% 
      str_squish()) %>% 
  
  # Discard the helper
  select(-last_sentence) 

输出:

>   authors_df %>% 
+     mutate(clean_content = str_trunc(clean_content, 80, "center", " ... ")) %>% 
+     print(n = nrow(.))

# A tibble: 25 × 3
   author_first_two    author_stopwords              clean_content                                                                   
   <chr>               <chr>                         <chr>                                                                           
 1 Harvey Fierstein    "Harvey Fierstein"            theirs to bear.                                                                 
 2 Siddharth Dhanvant  "Siddharth Dhanvant Shanghvi" young nation's love.                                                            
 3 Roosevelt was       "Roosevelt"                   The only thing we have to fear is fear ... ert retreat into advance. Franklin D.
 4 René Descartes      "René Descartes"              I think, therefore I am. This simple,  ... rn philosophy and rational thought..?
 5 William Shakespeare "William Shakespeare"         To be or not to be, that is the questi ... ngs and arrows of outrageous fortune.
 6 Socrates is         "Socrates"                    The unexamined life is not worth livin ... truth and understanding of the world.
 7 was an              ""                            In the end, we will remember not the w ...  in adversity. Martin Luther King Jr.
 8 Lao Tzu             "Lao Tzu"                     The journey of a thousand miles begins ... ce are key to completing the journey.
 9 Winston Churchill   "Winston Churchill"           Success is not final, failure is not f ... eep pushing forward despite setbacks.
10 Wayne Gretzky       "Wayne Gretzky"               You miss 100% of the shots you dont ta ...  as missed opportunities never score.
11 Thomas Edison       "Thomas Edison"               Genius is one percent inspiration and  ...  and persistence are key to success?!
12 Socrates was        "Socrates"                    The only true wisdom is in knowing you ...  a deeper understanding of the world.
13 , known             "known"                       The only way to do great work is to lo ... e Jobs was a co-founder of Apple Inc.
14 Albert Einstein     "Albert Einstein"             Imagination is more important than kno ... ted, but imagination knows no bounds.
15 The Dalai           "The Dalai Lama"              The purpose of our lives is to be happ ... thin oneself leads to true happiness.
16 Buddha was          "Buddha"                      Do not dwell in the past, do not dream ... Mindfulness brings clarity and peace.
17 Oscar Wilde         "Oscar Wilde"                 Be yourself; everyone else is already  ... ticity leads to personal fulfillment.
18 Robert Frost        "Robert Frost"                In three words I can sum up everything ... n. Life continues despite challenges!
19 Peter Drucker       "Peter Drucker"               The best way to predict the future is  ... ive action leads to desired outcomes!
20 Thomas ALva         "Thomas ALva Jefferson"       All men are created equal. This fundam ...  foundation of democratic principles.
21 Ralph Marston       "Ralph Marston"               What you do today can improve all your ... ay the groundwork for future success.
22 Oscar Wilde         "Oscar Wilde"                 The truth is rarely pure and never sim ... ies require thoughtful consideration.
23 Heraclitus was      "Heraclitus"                  The only constant in life is change. A ... o thriving in an ever-changing world.
24 Albert Einstein     "Albert Einstein"             In the middle of difficulty lies oppor ... es often lead to unexpected success??
25 Eleanor Roosevelt   "Eleanor Roosevelt"           The future belongs to those who believ ... n lead to extraordinary achievements.

正如我们所见,还有很多工作要做:
——“马丁·路德·金”,
——“富兰克林·D·罗斯福”,
——“苹果公司”联合创始人“史蒂夫·乔布斯”。

像“Sarah Of Light”或“King Willem-Alexander of the Dutch”这样的名字会因为“of”而出现问题。

尽管如此,即使它没有涵盖所有可能性,我希望这段代码可以帮助您构建自己的解决方案!


玩具数据和停用词

# Toy data
authors_df <- structure(
  list(
    content = c(
      "theirs to bear.Harvey Fierstein is an actor and playwright.", 
      "young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.", 
      "The only thing we have to fear is fear itself. Nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance. Franklin D. Roosevelt was the 32nd president of the United States and led the country during the Great Depression.", 
      "I think, therefore I am. This simple, yet profound statement lays the foundation for modern philosophy and rational thought..? René Descartes was a French philosopher and mathematician who revolutionized the way we approach the world.", 
      "To be or not to be, that is the question. Whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune.William Shakespeare was an English playwright and poet known for his works that have had a lasting impact on literature.", 
      "The unexamined life is not worth living. By questioning everything, we grow closer to the truth and understanding of the world. Socrates is an ancient Greek philosopher known for his Socratic method of questioning and dialogue.", 
      "In the end, we will remember not the words of our enemies, but the silence of our friends. True friends speak up for whats right, even in adversity. Martin Luther King Jr. was an American civil rights leader who fought for equality and justice for all.", 
      "The journey of a thousand miles begins with one step. Patience and perseverance are key to completing the journey. Lao Tzu was an ancient Chinese philosopher and the author of the Tao Te Ching.", 
      "Success is not final, failure is not fatal: It is the courage to continue that counts. Keep pushing forward despite setbacks. Winston Churchill was the British Prime Minister during World War II, known for his leadership and stirring speeches.", 
      "You miss 100% of the shots you dont take. Keep taking chances, as missed opportunities never score.Wayne Gretzky is a retired Canadian ice hockey player known as one of the greatest of all time.", 
      "Genius is one percent inspiration and ninety-nine percent perspiration. Hard work and persistence are key to success?! Thomas Edison was an American inventor who held over 1,000 patents and was known for his dedication to his work.", 
      "The only true wisdom is in knowing you know nothing. An open mind leads to a deeper understanding of the world. Socrates was a classical Greek philosopher known for his teachings on wisdom and humility.", 
      "The only way to do great work is to love what you do. Passion fuels creativity and innovation. Steve Jobs was a co-founder of Apple Inc., known for his passion for technology and design.", 
      "Imagination is more important than knowledge. Knowledge can be limited, but imagination knows no bounds. Albert Einstein was a German-born theoretical physicist known for his groundbreaking theories on the universe.", 
      "The purpose of our lives is to be happy. Finding peace within oneself leads to true happiness. The Dalai Lama is the spiritual leader of Tibetan Buddhism, known for his teachings on compassion and self-discovery.", 
      "Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment. Mindfulness brings clarity and peace. Buddha was the founder of Buddhism, known for his teachings on mindfulness and enlightenment.", 
      "Be yourself; everyone else is already taken. Authenticity leads to personal fulfillment. Oscar Wilde was an Irish playwright, poet, and novelist known for his wit and observations on human nature.", 
      "In three words I can sum up everything Ive learned about life: it goes on. Life continues despite challenges! Robert Frost was an American poet known for his insights into rural life and human experiences.", 
      "The best way to predict the future is to create it. Proactive action leads to desired outcomes! Peter Drucker was an Austrian-American management consultant and author known for his contributions to modern management.", 
      "All men are created equal. This fundamental truth is the foundation of democratic principles. Thomas Jefferson was the third President of the USA and the principal author of the Declaration of Independence.", 
      "What you do today can improve all your tomorrows. Todays efforts lay the groundwork for future success. Ralph Marston was an American sportswriter known for his insights into life and personal development.", 
      "The truth is rarely pure and never simple. Lifes complexities require thoughtful consideration. Oscar Wilde was an Irish playwright and poet known for his sharp wit and distinctive style.", 
      "The only constant in life is change. Adaptation is key to thriving in an ever-changing world. Heraclitus was an ancient Greek philosopher known for his teachings on change and impermanence.", 
      "In the middle of difficulty lies opportunity. Challenges often lead to unexpected success?? Albert Einstein was a renowned physicist known for his theory of relativity and insights into the nature of reality.", 
      "The future belongs to those who believe in the beauty of their dreams. Embracing your vision can lead to extraordinary achievements. Eleanor Roosevelt was the First Lady of the USA and an advocate for civil rights and humanitarian causes.")), 
  
  row.names = c(NA, -25L), 
  class = c("tbl_df", "tbl", "data.frame"))

# Stopwords
> dput(tm::stopwords())
c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", 
  "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", 
  "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", 
  "theirs", "themselves", "what", "which", "who", "whom", "this", "that", 
  "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", 
  "have", "has", "had", "having", "do", "does", "did", "doing", "would", 
  "should", "could", "ought", "i'm", "you're", "he's", "she's", "it's", 
  "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", 
  "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", 
  "we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't", 
  "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", 
  "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", 
  "that's", "who's", "what's", "here's", "there's", "when's", "where's", 
  "why's", "how's", "a", "an", "the", "and", "but", "if", "or", "because", 
  "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", 
  "between", "into", "through", "during", "before", "after", "above", "below", 
  "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", 
  "again", "further", "then", "once", "here", "there", "when", "where", "why", 
  "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", 
  "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very")
© www.soinside.com 2019 - 2024. All rights reserved.