我正在寻求使用 Playwright 抓取网页。
我加载页面,并成功单击 Playwright 的下载按钮。这将打开一个打印对话框,其中选择了打印机。
我想选择“另存为 PDF”,然后单击“保存”按钮。
这是我当前的代码:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
playwright_page = browser.new_page()
got_error = False
try:
playwright_page.goto(url_to_start_from)
print(playwright_page.title())
html = playwright_page.content()
except Exception as e:
print(f"Playwright exception: {e}")
got_error = True
if not got_error:
soup = BeautifulSoup(html, 'html.parser')
#download pdf
with playwright_page.expect_download() as download_info:
playwright_page.locator("text=download").click()
download = download_info.value
path = download.path()
download.save_as(DOWNLOADED_PDF_FOLDER)
browser.close()
有没有办法使用 Playwright 来做到这一点?
您实际上并不需要打印对话框,您可以通过模拟媒体类型直接从 Playwright 生成该对话框。
await page.emulateMedia({ media: "print" });
await page.goto("https://robstarbuck.uk/cv");
await page.pdf({ path: "./cv.pdf", format: "A4" });
这就是我生成简历的方式。
另请参阅:
非常感谢评论中的@KJ,他建议使用
headless=True
,Chromium 甚至不会首先设置打印对话框。