无法从 pdf 文件中收集交易下的所有行

问题描述 投票:0回答:1

我正在尝试从此 pdf 文件中提取交易表下的所有行。我创建的脚本可以抓取第一个和最后一个标题下的第一行。我怎样才能收集该页面的所有行?

import os
import io
import re
import requests
import pdfplumber

pdf_url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf'

response = requests.get(pdf_url)

with io.BytesIO(response.content) as f:
    with pdfplumber.open(f) as pdf:
        text_content = ""
        for page in pdf.pages:
            text_content += page.extract_text()

pattern = r'(?:iD owner asset transaction Date notification amount cap\.\s*type Date gains >\s*\$200\?\s*|iD owner asset transaction Date notification(?: amount)?\s*type Date\s*)\s*([^\n]+)'
matches = re.findall(pattern, text_content, re.IGNORECASE | re.DOTALL)
for match in matches:
    print(match.strip())

电流输出:

JT Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
FIlINg STATuS: New
u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000

供您参考,这是我感兴趣的线路类型:

Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000

python python-3.x web-scraping python-requests pdfplumber
1个回答
0
投票

也许你可以使用更简单的策略 - 找到所有带有

$
:

的行
import pdfplumber
import requests

pdf_url = "https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf"

response = requests.get(pdf_url)

with io.BytesIO(response.content) as f:
    with pdfplumber.open(f) as pdf:
        out = []
        for page in pdf.pages:
            for line in page.extract_text().splitlines():
                if "$" in line:
                    out.append(line.removeprefix("JT "))

print(out)

打印:

[
    "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000",
    "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "barrick gold Corporation (AbX) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Eldorado gold Corporation Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "First Trust ISE-Revere Natural gas S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $15,001 - $50,000",
    "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Kinross gold Corporation (KgC) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "North American Palladium, ltd. (PAl) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pan American Silver Corp. (PAAS) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pilot gold, Inc Ordinary Shares (PlgTF) S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Pinetree Capital ltd Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Rare Element Resources ltd. Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    "SPdR S&P International dividend ETF P 07/1/2016 07/1/2016 $1,001 - $15,000",
    "u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000",
    "Yamana gold Inc. Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
]
© www.soinside.com 2019 - 2024. All rights reserved.