我正在尝试从此 pdf 文件中提取交易表下的所有行。我创建的脚本可以抓取第一个和最后一个标题下的第一行。我怎样才能收集该页面的所有行?
import os
import io
import re
import requests
import pdfplumber
pdf_url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf'
response = requests.get(pdf_url)
with io.BytesIO(response.content) as f:
with pdfplumber.open(f) as pdf:
text_content = ""
for page in pdf.pages:
text_content += page.extract_text()
pattern = r'(?:iD owner asset transaction Date notification amount cap\.\s*type Date gains >\s*\$200\?\s*|iD owner asset transaction Date notification(?: amount)?\s*type Date\s*)\s*([^\n]+)'
matches = re.findall(pattern, text_content, re.IGNORECASE | re.DOTALL)
for match in matches:
print(match.strip())
电流输出:
JT Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
FIlINg STATuS: New
u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000
供您参考,这是我感兴趣的线路类型:
Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
也许你可以使用更简单的策略 - 找到所有带有
$
: 的行
import pdfplumber
import requests
pdf_url = "https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf"
response = requests.get(pdf_url)
with io.BytesIO(response.content) as f:
with pdfplumber.open(f) as pdf:
out = []
for page in pdf.pages:
for line in page.extract_text().splitlines():
if "$" in line:
out.append(line.removeprefix("JT "))
print(out)
打印:
[
"Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000",
"Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"barrick gold Corporation (AbX) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Eldorado gold Corporation Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
"First Trust ISE-Revere Natural gas S 06/29/2016 06/30/2016 $1,001 - $15,000",
"goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $15,001 - $50,000",
"goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Kinross gold Corporation (KgC) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"North American Palladium, ltd. (PAl) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Pan American Silver Corp. (PAAS) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Pilot gold, Inc Ordinary Shares (PlgTF) S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Pinetree Capital ltd Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Rare Element Resources ltd. Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
"Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
"SPdR S&P International dividend ETF P 07/1/2016 07/1/2016 $1,001 - $15,000",
"u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000",
"Yamana gold Inc. Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
]