处理多个文件时,Ghostscript中出现致命错误

问题描述 投票:1回答:1

Python 3.7.5

OS:Windows Server 2016

Ghostscript版本:9.5

我正在尝试使用Ghostscript对目录中的多个PDF进行文本提取。该目录当前包含2个PDF:1234.pdf和5678.pdf。

import os
import sys

def pdf2txt(directory,file):
    import locale
    import ghostscript
    args=[file,"-dBATCH","-dNOPAUSE","-dNOPROMPT","-sDEVICE=txtwrite","-sOutputFile="+directory+"\\output\\"+file+"-%d.txt",directory+"\\"+file]
    encoding=locale.getpreferredencoding()
    args=[a.encode(encoding) for a in args]
    print (args)
    ghostscript.Ghostscript(*args)

directory=sys.argv[1]

files=os.listdir(directory)
for file in files:
    print("Trying "+directory+"\\"+file)
    pdf2txt(directory,file)

我遇到的问题是,第一个PDF可以毫无问题地处理,但是尝试处理第二个PDF总是会导致Python阻塞。我注意到即使从Python控制台进行文本提取也遇到此错误。我可以提取第二个文件的唯一方法是退出Python并重新启动它。

我已经重命名了文件,所以第二个PDF首先得到处理。该PDF可以毫无问题地处理,现在成功处理的第二个PDF现在会引发致命错误。我尝试过将args列表和编码变量重设为零,并在ghostscript中调用不存在的方法,例如.quit()或.exit()。我确实看到了一篇帖子,其中提到exit方法在init中被注释掉了。我删除了评论,但没有成功。

C:\Users\bob\Documents>python exporter.py c:\users\bob\Desktop\PDFs
Trying c:\users\bob\Desktop\PDFs\1234.pdf
[b'1234.pdf', b'-dBATCH', b'-dNOPAUSE', b'-dNOPROMPT', b'-sDEVICE=txtwrite', b'-sOutputFile=c:\\users\\bob\\Desktop\\PDFs\\output\\1234.pdf-%d.txt', b'c:\\users\\bob\\Desktop\\PDFs\\1234.pdf']
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 22.
Page 1
Page 2
Page 3
Page 4
Trying c:\users\bob\Desktop\PDFs\5678.pdf
[b'5678.pdf', b'-dBATCH', b'-dNOPAUSE', b'-dNOPROMPT', b'-sDEVICE=txtwrite', b'-sOutputFile=c:\\users\\bob\\Desktop\\PDFs\\output\\5678.pdf-%d.txt', b'c:\\users\\bob\\Desktop\\PDFs\\5678.pdf']
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Traceback (most recent call last):
  File "exporter.py", line 18, in <module>
    pdf2txt(directory,file)
  File "exporter.py", line 11, in pdf2txt
    ghostscript.Ghostscript(*args)
  File "C:\Program Files\Python37\lib\site-packages\ghostscript\__init__.py", line 174, in Ghostscript
    stderr=kw.get('stderr', None))
  File "C:\Program Files\Python37\lib\site-packages\ghostscript\__init__.py", line 74, in __init__
    rc = gs.init_with_args(instance, args)
  File "C:\Program Files\Python37\lib\site-packages\ghostscript\_gsprint.py", line 273, in init_with_args
    raise GhostscriptError(rc)
ghostscript._gsprint.GhostscriptError: Fatal
python ghostscript
1个回答
0
投票

我今天遇到了同样的问题,发现应该在ghostscript.Ghostscript块中调用with。另外,在创建ghostscript.Ghostscript的新实例之前,我必须调用ghostscript.cleanup()

尝试一下:

import os
import sys

def pdf2txt(directory,file):
    import locale
    import ghostscript
    args=[file,"-dBATCH","-dNOPAUSE","-dNOPROMPT","-sDEVICE=txtwrite","-sOutputFile="+directory+"\\output\\"+file+"-%d.txt",directory+"\\"+file]
    encoding=locale.getpreferredencoding()
    args=[a.encode(encoding) for a in args]
    print (args)
    with ghostscript.Ghostscript(*args) as g:
        ghostscript.cleanup()

directory=sys.argv[1]

files=os.listdir(directory)
for file in files:
    print("Trying "+directory+"\\"+file)
    pdf2txt(directory,file)
© www.soinside.com 2019 - 2024. All rights reserved.