如何从一个语料库中随机选择段落,从随机化中排除那些包含特定单词列表的段落?

问题描述 投票:0回答:1

我有一个语料库,我想从这个语料库中随机抽取段落。我想从这个语料库中随机抽取段落。然而,随机化工作必须是这样的,即段落中包含 特定 词不能取样。

这是一个例子。

txt <- c("PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard",
         "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard",
         "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper.",
         "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.",
         "Fiscal policy is not as good as people may think",
         "Economics is fun. \n\n I prefer Macro.")
corp <- corpus(txt, docvars = data.frame(serial = 1:6))

这是一个例子: 它是straigthforward做它没有任何约束。

reshape = corpus_reshape(corp, "paragraphs")
sample = corpus_sample(reshape, 4)

# Result

[1] "Economics is fun."                                "Fiscal policy is not as good as people may think"
[3] "Fiscal policy is a bad thing."                    "Quarentine is hard"

正如你所看到的,随机化选择了包含以下内容的 "段落" 财政政策. 我希望对语料库进行抽样调查,排除段落句子中的 财政政策 出现。

在做抽样之前,我可能会删除原始数据集中与这个词相关的句子?你会怎么做呢?

请注意,在真实的数据集中,我需要排除不止一两个关键词的句子。所以,请您提出一些可以轻松扩展到很多词的建议。

非常感谢!我有一个语料库。

r dataframe dictionary corpus quanteda
1个回答
1
投票

如果你想排除 段落句子 含有 "财政政策 "的,那么你需要先将文本重塑成段落,然后过滤掉含有排除短语的术语,然后再进行采样。

如果您过滤掉文本 之前 创建语料库时,您将从包含过滤短语的输入文本中排除非过滤短语段落。

library("quanteda")
## Package version: 2.0.1
set.seed(10)

txt <- c(
  "PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard",
  "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard",
  "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper.",
  "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.",
  "Fiscal policy is not as good as people may think",
  "Economics is fun. \n\n I prefer Macro."
)

corp <- corpus(txt, docvars = data.frame(serial = 1:6)) %>%
  corpus_reshape(to = "paragraphs")
tail(corp)
## Corpus consisting of 6 documents and 1 docvar.
## text3.2 :
## "quarantine it's good tough to focus on paper."
## 
## text4.1 :
## "Fiscal policy is a bad thing."
## 
## text4.2 :
## "SO is a great place where skilled people solve coding proble..."
## 
## text5.1 :
## "Fiscal policy is not as good as people may think"
## 
## text6.1 :
## "Economics is fun."
## 
## text6.2 :
## "I prefer Macro."

现在我们可以根据模式匹配来进行子集。

corp2 <- corpus_subset(corp, !grepl("fiscal policy", corp, ignore.case = TRUE))
tail(corp2)
## Corpus consisting of 6 documents and 1 docvar.
## text2.2 :
## "quarantine is very very hard"
## 
## text3.1 :
## "Very long sentence, with three parts, separated by commas.  ..."
## 
## text3.2 :
## "quarantine it's good tough to focus on paper."
## 
## text4.2 :
## "SO is a great place where skilled people solve coding proble..."
## 
## text6.1 :
## "Economics is fun."
## 
## text6.2 :
## "I prefer Macro."

corpus_sample(corp2, size = 4)
## Corpus consisting of 4 documents and 1 docvar.
## text6.2 :
## "I prefer Macro."
## 
## text1.2 :
## "Quarentine is hard"
## 
## text2.2 :
## "quarantine is very very hard"
## 
## text3.2 :
## "quarantine it's good tough to focus on paper."

含有 "财政政策 "的段落就没有了。

请注意,这里我使用了 grepl() 但一个全面优越的替代品是。str_detect()蔓生 (或同等 弦子 包装器)。) 这些也给了你更多的控制权,可以使用更快的固定匹配,同时也可以控制是否匹配大小写。

all.equal(
  grepl("fiscal policy", txt, ignore.case = TRUE),
  stringi::stri_detect_fixed(txt, "fiscal policy", case_insensitive = TRUE),
  stringr::str_detect(txt, fixed("fiscal policy"), case_insensitive = TRUE)
)
## [1] TRUE

1
投票

如果你有了文本,那么你可以使用子集与 grepl 在設立資料庫前,刪去 "財政政策"(或任何其他字眼)。

txt2 <- txt[!grepl("fiscal policy|I am groot", tolower(txt))]
txt2

[1] "PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard"                             
[2] "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard"                                                        
[3] "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper."
[4] "Economics is fun. \n\n I prefer Macro."

第4和5项没有被选中。现在进行采样。

如果您只有语料库,那么提取文本,然后使用上面的代码。

txt <- texts(corp)
© www.soinside.com 2019 - 2024. All rights reserved.