R-将提取的文本数据(每个实例作为行)导出为data.frame格式

问题描述 投票:1回答:1

我正在尝试从i个标准化.txt格式中的i个标准化实例中提取文本/将文本导出到数据帧中,其中每个实例都是单独的一行。然后,我想将该数据导出为.xlsx文件。到目前为止,我可以成功提取数据(尽管算法提取的内容比声明的gregexpr()参数略多),但只能将.txt导出为大量文本。

  1. 如何在每个实例都有自己的行的情况下创建提取的txt文件的文本的数据框?(一旦数据采用data.frame格式,我知道如何从那里导出为xlsx。)
  2. 如何仅从设置的参数中提取数据?

在帮助下(尤其是在Ben from the comments of this post中,这是我到目前为止所拥有的:

# Txt Data Format
txt 1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz.
 B. The Second: abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz.
 D. The Fourth: abcdefg hijklmnop qrstuv wxyz.

 A. The First:  abcdefg hijklmnop qrstuv wxyz.
 B. The Second: abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz.
 D. The Fourth: abcdefg hijklmnop qrstuv wxyz."

txt 2 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz.
 B. The Second: abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz.
 D. The Fourth: abcdefg hijklmnop qrstuv wxyz.

 A. The First:  abcdefg hijklmnop qrstuv wxyz.
 B. The Second: abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz.
 D. The Fourth: abcdefg hijklmnop qrstuv wxyz."


#################################
# Directory and Text Extraction #
#################################

dest <- "C:/Desktop/"
docs_text <- list.files(path = dest, pattern = "txt",  full.names = TRUE)

## Assumes that all the content I want to extract is between "A." and "C." in 
## the text while ignoring "C." and "D." content.

docs_list <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})

lapply(1:length(docs_doc),  function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ", 
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

当前输出看起来像这样,其中所有文本都在一行中,并且捕获的内容不仅仅是在“ A”之间。和“ C”:

Current Output

所需的输出看起来像这样,其中多行文本仅在“ A”之间。和“ C”。被捕获,每个多行捕获在每个实例中分配一行:

Desired Output

您可以提供的任何帮助都会非常有帮助!

我最终试图开发一个NLP模型,该模型可以从数百个大型PDF中提取标准化的表单数据,用于一年一度的存储库。如果这篇文章表明我不是在考虑如何有效地解决这个问题,那么我是开放的。

提前感谢!

r machine-learning nlp artificial-intelligence text-extraction
1个回答
1
投票

为了方便dplyr对象和非常有效的tibble命令,我正在使用bind_rows

dest <- "~"
docs_text <- list.files(path = dest, pattern = "txt",  full.names = TRUE)

library(dplyr)

docs_df <- lapply(docs_text, function(f) {
  lines <- readLines(f)
  tibble(
    file = basename(f),
    line = seq_along(lines),
    text = lines
  )
  }) %>% 
  bind_rows()

一旦您有适当的data.frame,就很容易使用filtergrepl对其进行子集查找匹配的文本。我正在使用正则表达式"^A.|^B.",它查找以A.或B.开头的行:

docs_df %>% 
  filter(grepl("^A.|^B.", text))
#> # A tibble: 8 x 3
#>   file       line text                                         
#>   <chr>     <int> <chr>                                        
#> 1 txt_1.txt     1 A. The First:  abcdefg hijklmnop qrstuv wxyz.
#> 2 txt_1.txt     2 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 3 txt_1.txt     6 A. The First:  abcdefg hijklmnop qrstuv wxyz.
#> 4 txt_1.txt     7 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 5 txt_2.txt     1 A. The First:  abcdefg hijklmnop qrstuv wxyz.
#> 6 txt_2.txt     2 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 7 txt_2.txt     6 A. The First:  abcdefg hijklmnop qrstuv wxyz.
#> 8 txt_2.txt     7 B. The Second: abcdefg hijklmnop qrstuv wxyz.

要导出到Excel,我建议rio::export()

© www.soinside.com 2019 - 2024. All rights reserved.