R语言文本摘要

问题描述 投票:0回答:4

我有一个很长的文本文件,使用

R language
的帮助,我想用至少 10 到 20 行或小句子来总结文本。 如何用
R language
总结至少 10 行文本?

r text text-mining summarization
4个回答
5
投票

您可以尝试这个(来自LSAfun包):

genericSummary(D,k=1)

其中“D”指定您的文本文档,“k”指定摘要中要使用的句子数。 (进一步的修改显示在包文档中)。

欲了解更多信息: http://search.r-project.org/library/LSAfun/html/genericSummary.html


3
投票

有一个名为 lexRankr 的包,它以与 Reddit 的 /u/autotldr 机器人总结文章相同的方式总结文本。 本文有关于如何使用它的完整演练,但只是作为一个快速示例,以便您可以在 R 中自行测试:

#load needed packages
library(xml2)
library(rvest)
library(lexRankr)

#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"

#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))

#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
                          #only 1 article; repeat same docid for all of input vector
                          docId = rep(1, length(page_text)),
                          #return 3 sentences to mimick /u/autotldr's output
                          n = 3,
                          continuous = TRUE)

#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]

> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."                                
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."  

0
投票

这是一种基于 Pegasus 变压器模型的方法:

library(reticulate)

conda_Env <- conda_list()

if(any(conda_Env[, 1] == "summary") == FALSE)
{
  reticulate::conda_create(envname = "summary", packages = c("transformers", "SentencePiece"), python_version = "3.9.16")
  reticulate::conda_install(envname = "summary", packages = "torch", pip = TRUE)
}  

reticulate::use_condaenv(condaenv = "summary")
transformers <- import(module = "transformers")

tokenizer <- transformers$AutoTokenizer$from_pretrained("google/pegasus-xsum")
model <- transformers$PegasusForConditionalGeneration$from_pretrained("google/pegasus-xsum")

summarize <- function(text)
{
  inputs <- tokenizer(text, return_tensors = "pt")
  output_sequences <- model$generate(input_ids = inputs$input_ids)
  summarized_text <- tokenizer$batch_decode(output_sequences)
  return(summarized_text)
}

text <-  "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a\n 
          parliamentary hearing into allegations of regulatory interference.\n
          It is the first time MEPs have used new rules to withdraw parliamentary access for firms that ignore\n
          a summons to attend parliamentary inquiries or hearings.\n
          Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg.\n
          While a formal process still needs to be worked through, a spokesman for the parliament’s president Antonio Tajani said that\n
          the leaders of all major parliamentary blocks had backed the ban in a vote this morning."
        
summarize(text)

[1] "<pad> MEPs have taken the first step in blocking access to the European Parliament for lobbying firms.</s>"


0
投票

这是另一种可以考虑基于chatGPT的方法:

library(chatgpt)
question <- "Can you summarize the following text in one sentence : \n Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a\n 
             parliamentary hearing into allegations of regulatory interference.\n
             It is the first time MEPs have used new rules to withdraw parliamentary access for firms that ignore\n
             a summons to attend parliamentary inquiries or hearings.\n
             Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg.\n
             While a formal process still needs to be worked through, a spokesman for the parliament’s president Antonio Tajani said that\n
             the leaders of all major parliamentary blocks had backed the ban in a vote this morning."

Sys.setenv(OPENAI_API_KEY = "xxx")
chatgpt::reset_chat_session()
ask_chatgpt(question)

[1] "Monsanto lobbyists have been banned from the European parliament for refusing to attend a hearing on allegations of regulatory interference, marking the first time new rules have been used to withdraw parliamentary access from firms that ignore summons to attend inquiries or hearings."
© www.soinside.com 2019 - 2024. All rights reserved.