我的多个子文件夹中有许多包含代码的 PDF。我想从这些 PDF 中提取文本并以编程方式将它们转换为 Jupyter 笔记本。
据我所知,没有程序/包可以直接将 PDF 转换为 .ipynb 格式。然而,
pdftools
包有一个功能pdf_text
,可以从PDF中提取文本并存储为字符向量,并且pandoc允许从markdown转换为Jupyter笔记本格式。我基本上执行了一系列命令,首先将 PDF 转换为字符向量,使用 cat()
将字符向量存储为文本文件,然后通过系统调用转换为 pandoc 中的 ipynb。
代码示例:
# List files in working directory
code.files <- list.files(
path = ".",
pattern = ".pdf$",
all.files = TRUE,
full.names = TRUE,
recursive = TRUE,
ignore.case = TRUE
)
# File conversion PDF --> txt --> ipynb
lapply(
code.files,
function(x) {
# Extract text from PDF
pdftools::pdf_text(x) %>%
# Save extracted text to file
cat(file = xfun::with_ext(x , "txt"))
# Create system command with the correct file name extensions
f <- paste( "pandoc -f markdown -t ipynb", "-o", xfun::with_ext(x , "ipynb"), xfun::with_ext(x , "txt") )
# Make system call
system(f, intern = TRUE)
}
)
现在,PDF 提取工作得很好,但是,转换为 .ipynb 会改变格式。
This is a WinBUGS program for the real example in Chapter 7, Section 7.2.1.
Model: Structural Equation Model with dichotomous data
Date Set Names: full1.dat, and XI.dat, where XI.dat are input initial values for xi.
Sample Size: N=837
model{
for(i in 1:N){
#measurement equation model
for(j in 1:P){y[i,j]~dnorm(mu[i,j],psi[j])I(low[z[i,j]+1],high[z[i,j]+1])}
mu[i,1]<-eta[i]
mu[i,2]<-lam[1]*eta[i]
mu[i,3]<-lam[2]*eta[i]
mu[i,4]<-xi[i,1]
mu[i,5]<-lam[3]*xi[i,1]
mu[i,6]<-lam[4]*xi[i,1]
mu[i,7]<-xi[i,2]
mu[i,8]<-lam[5]*xi[i,2]
mu[i,9]<-lam[6]*xi[i,2]
#structural equation model
xi[i,1:2]~dmnorm(u[1:2],phi[1:2,1:2])
eta[i]~dnorm(nu[i],psd)
nu[i]<-gam[1]*xi[i,1]+gam[2]*xi[i,2]
} #end of i
for(j in 1:P){psi[j]<-1.0}
for(j in 1:2){u[j]<-0.0}
#priors on loadings and coefficients
lam[1]~dnorm(3.12,4.0) lam[2]~dnorm(0.10,4.0) lam[3]~dnorm(3.32,4.0)
lam[4]~dnorm(3.10,4.0) lam[5]~dnorm(4.30,4.0) lam[6]~dnorm(3.14,4.0)
var.gam<-4.0*psd
gam[1]~dnorm(-1.0,var.gam) gam[2]~dnorm(0.86,var.gam)
#priors on precisions
psd~dgamma(8.0, 10.0)
sgd<-1/psd
phi[1:2,1:2]~dwish(R[1:2,1:2], 8)
phx[1:2,1:2]<-inverse(phi[1:2,1:2])
} # end of model
Data
list(N=837, P=9, low=c(-2000,0), high=c(0,2000),
R=structure(
.Data=c(1.0, 0.0,
0.0, 1.0),.Dim=c(2,2)),
z=structure(
.Data=c(paste the full1.dat here),.Dim=c(837,9)))
Three different Initial values
list(lam=c(0.8,0.8,0.8,0.8,0.8,0.8),gam=c(-1.2,1.0),psd=0.5,
phi=structure(
.Data=c(1.0, 0.5,
0.5,1.0),.Dim=c(2,2)),
xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
list(lam=c(0.6,0.6,0.6,0.6,0.6,0.6),gam=c(-1.0,0.8),psd=1.0,
phi=structure(
.Data=c(1.2, 0.0,
0.0,1.2),.Dim=c(2,2)),
xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
list(lam=c(1.0,1.0,1.0,1.0,1.0,1.0),gam=c(-1.5,1.2),psd=0.8,
phi=structure(
.Data=c(0.8,0.1,
0.1,0.8),.Dim=c(2,2)),
xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
This is a WinBUGS program for the real example in Chapter 7, Section
7.2.1.
Model: Structural Equation Model with dichotomous data Date Set Names:
full1.dat, and XI.dat, where XI.dat are input initial values for xi.
Sample Size: N=837
model{ for(i in 1:N){ #measurement equation model for(j in
1:P){y\[i,j\]~dnorm(mu\[i,j\],psi\[j\])I(low\[z\[i,j\]+1\],high\[z\[i,j\]+1\])}
mu\[i,1\]\<-eta\[i\] mu\[i,2\]\<-lam\[1\]*eta\[i\]
mu\[i,3\]\<-lam\[2\]*eta\[i\] mu\[i,4\]\<-xi\[i,1\]
mu\[i,5\]\<-lam\[3\]*xi\[i,1\] mu\[i,6\]\<-lam\[4\]*xi\[i,1\]
mu\[i,7\]\<-xi\[i,2\] mu\[i,8\]\<-lam\[5\]*xi\[i,2\]
mu\[i,9\]\<-lam\[6\]*xi\[i,2\] #structural equation model
xi\[i,1:2\]~dmnorm(u\[1:2\],phi\[1:2,1:2\]) eta\[i\]~dnorm(nu\[i\],psd)
nu\[i\]\<-gam\[1\]*xi\[i,1\]+gam\[2\]*xi\[i,2\] } #end of i for(j in
1:P){psi\[j\]\<-1.0} for(j in 1:2){u\[j\]\<-0.0} #priors on loadings and
coefficients lam\[1\]~dnorm(3.12,4.0) lam\[2\]~dnorm(0.10,4.0)
lam\[3\]~dnorm(3.32,4.0) lam\[4\]~dnorm(3.10,4.0)
lam\[5\]~dnorm(4.30,4.0) lam\[6\]~dnorm(3.14,4.0) var.gam\<-4.0\*psd
gam\[1\]~dnorm(-1.0,var.gam) gam\[2\]~dnorm(0.86,var.gam) #priors on
precisions psd~dgamma(8.0, 10.0) sgd\<-1/psd
phi\[1:2,1:2\]~dwish(R\[1:2,1:2\], 8)
phx\[1:2,1:2\]\<-inverse(phi\[1:2,1:2\]) } \# end of model
Data list(N=837, P=9, low=c(-2000,0), high=c(0,2000), R=structure(
.Data=c(1.0, 0.0, 0.0, 1.0),.Dim=c(2,2)), z=structure( .Data=c(paste the
full1.dat here),.Dim=c(837,9)))
Three different Initial values
list(lam=c(0.8,0.8,0.8,0.8,0.8,0.8),gam=c(-1.2,1.0),psd=0.5,
phi=structure( .Data=c(1.0, 0.5, 0.5,1.0),.Dim=c(2,2)), xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
list(lam=c(0.6,0.6,0.6,0.6,0.6,0.6),gam=c(-1.0,0.8),psd=1.0,
phi=structure( .Data=c(1.2, 0.0, 0.0,1.2),.Dim=c(2,2)), xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
list(lam=c(1.0,1.0,1.0,1.0,1.0,1.0),gam=c(-1.5,1.2),psd=0.8,
phi=structure( .Data=c(0.8,0.1, 0.1,0.8),.Dim=c(2,2)), xi=structure(
.Data=c(paste the XI.dat here),.Dim=c(837,2)))
转换为 .ipynb 会将整个文本正文添加为单个 Markdown 单元格,并且该过程会在大括号前面添加反斜杠(例如
\[
),同时丢失一些换行符。
我非常希望在转换为 Jupyter Notebook 后保留 PDF 文件的格式。转换后的 txt 中的格式足够接近,但转换为 ipynb 时添加的斜杠和丢失的换行符我非常想避免。
有人有解决办法吗?
要从 PDF 文件中提取文本并将其转换为 Jupyter 笔记本格式 (.ipynb),同时保留格式,您可以使用 tabulizer 包从 PDF 中提取文本,然后在 R 中构建具有所需格式的 Jupyter 笔记本。
install.packages("tabulizer")
install.packages("jsonlite")
library(tabulizer)
# Replace 'path_to_pdf.pdf' with the actual path to your PDF file
pdf_text <- extract_text("path_to_pdf.pdf")
library(jsonlite)
notebook_content <- list(
cells = list(
list(
cell_type = "markdown",
metadata = list(),
source = pdf_text
)
),
metadata = list(
kernelspec = list(
display_name = "R",
language = "R",
name = "ir"
),
language_info = list(
codemirror_mode = "r",
file_extension = ".r",
mimetype = "text/x-r-source",
name = "R"
)
),
nbformat = 4,
nbformat_minor = 2
)
# Convert the notebook content to JSON
notebook_json <- toJSON(notebook_content, pretty = TRUE)
# Replace 'output_notebook.ipynb' with the desired output file name
writeLines(notebook_json, "output_notebook.ipynb")
请注意,由于底层格式的差异,在 PDF 到 .ipynb 转换中保留格式(尤其是代码格式)的复杂性相当高。您可能需要尝试这种方法,并可能需要进行一些手动调整以确保达到预期的结果。