在Python中从PDF中提取超链接

Question

我有一个 PDF 文档，其中有一些超链接，我需要从 pdf 中提取所有文本。我使用了 PDFMiner 库和来自 https://web.archive.org/web/20150206080323/https://endlesscurious.com/2012/06/13/scraping-pdf-with-python/ 的代码来提取文本。但是，它不会提取超链接。

例如，我有文字显示“检查此链接”，并附有一个链接。我能够提取单词 Check this link out，但我真正需要的是超链接本身，而不是单词。

我该如何去做呢？理想情况下，我更喜欢用 Python 来做这件事，但我也愿意用任何其他语言来做这件事。

我看过

itextsharp

，但没用过。我正在

Ubuntu

上运行，非常感谢任何帮助。

Answer 1

import PyPDF2 PDFFile = open("file.pdf",'rb') PDF = PyPDF2.PdfFileReader(PDFFile) pages = PDF.getNumPages() key = '/Annots' uri = '/URI' ank = '/A' for page in range(pages): print("Current Page: {}".format(page)) pageSliced = PDF.getPage(page) pageObject = pageSliced.getObject() if key in pageObject.keys(): ann = pageObject[key] for a in ann: u = a.getObject() if uri in u[ank].keys(): print(u[ank][uri])

Answer 2

可以使用 PDFMiner 获取超链接。复杂的是（就像 PDF 的很多内容一样），链接注释和链接文本之间实际上没有任何关系，只是它们都位于页面的同一区域。

这是我用来获取 PDFPage 上的链接的代码

annotationList = [] if page.annots: for annotation in page.annots.resolve(): annotationDict = annotation.resolve() if str(annotationDict["Subtype"]) != "/Link": # Skip over any annotations that are not links continue position = annotationDict["Rect"] uriDict = annotationDict["A"].resolve() # This has always been true so far. assert str(uriDict["S"]) == "/URI" # Some of my URI's have spaces. uri = uriDict["URI"].replace(" ", "%20") annotationList.append((position, uri))

然后我定义了一个函数，例如：

def getOverlappingLink(annotationList, element): for (x0, y0, x1, y1), url in annotationList: if x0 > element.x1 or element.x0 > x1: continue if y0 > element.y1 or element.y0 > y1: continue return url else: return None

我用它来搜索我之前在页面上找到的注释列表，以查看是否有任何超链接占据与我在页面上检查的 LTTextBoxHorizontal 相同的区域。

就我而言，由于 PDFMiner 在文本框中合并了太多文本，因此我遍历每个文本框的 _objs 属性，并查看所有 LTTextLineHorizontal 实例，看看它们是否与任何注释位置重叠。

Answer 3

PDFFile = open('File Location','rb') PDF = pyPdf.PdfFileReader(PDFFile) pages = PDF.getNumPages() key = '/Annots' uri = '/URI' ank = '/A' for page in range(pages): pageSliced = PDF.getPage(page) pageObject = pageSliced.getObject() if pageObject.has_key(key): ann = pageObject[key] for a in ann: u = a.getObject() if u[ank].has_key(uri): print u[ank][uri]

我希望这可以在您的 PDF 中提供链接。
P.S：我还没有广泛尝试过这个。

Answer 4

您将在给定的 PDF 中获得以分号分隔的链接列表

Answer 5

我认为处理寻找 LNK 类型的注释相对容易。

Answer 6

import PyPDF2 pdf = PyPDF2.PdfFileReader('filename.pdf') urls = [] for page in range(pdf.numPages): pdfPage = pdf.getPage(page) try: for item in (pdfPage['/Annots']): urls.append(item['/A']['/URI']) except KeyError: pass

在Python中从PDF中提取超链接

问题描述投票：0回答：6

6个回答

最新问题

在Python中从PDF中提取超链接

问题描述 投票：0回答：6

6个回答

最新问题

问题描述投票：0回答：6