通过Python或其他更好的工具根据日期自动将每日更新的多个网页转换为单个pdf

Question

我正在准备考试，这些网站会定期上传时事。

我不想每天重新打开这些多个网站，而是想简化这个过程，将这些网页下载为 pdf 格式并将这些 pdf 合并为一个，所以最后我必须阅读一个 pdf（而不是点击千需要阅读 3 页）。

我确实尝试过使用 python 和 powershell 寻找选项，但这些方法并不是在每个网站上都有效，我必须自己找到（并非在每个页面上都找到）并更改网页链接（这不计入自动化），我想），由于我不喜欢编程，所以我只能从互联网上复制粘贴东西，这是行不通的。

所以我正在寻找自动化脚本； •根据日期，每天从所有网站（如下所示）中获取最新时事的合并单一 pdf 文件。

• 如果一种方法不适用于所有站点，我可以运行多个脚本（这个旧盒子有时会遇到麻烦）

• 在某些站点，预赛和主要时事是分开的，如果您也能做到这一点，我们将不胜感激。

• 请解释这些步骤，以便我将来可以在需要时在其他网站上重新创建此内容

网站：[[ https://vajiramandravi.com/upsc-daily-current-affairs/

https://visionias.in/resources/daily_current_affairs.php?type=1

https://www.drishtiias.com/current-affairs-news-analysis-editorials

https://forumias.com/blog/subjectwise-current-affairs-for-upsc-ias-prelims-examination/

https://iasbaba.com/current-affairs-for-ias-upsc-exams/。 ]]

谢谢。

PS：我没有 24x7 的互联网连接，因此我只能在给定的允许持续时间内运行脚本。

Answer 1

要将每日更新的网页自动转换为单个 PDF，您需要一个脚本：

将网页内容下载为 PDF
将 PDF 合并为一个 PDF
处理页面结构中的错误或更改
按计划运行

考虑到您的限制，让我们使用以下工具构建基于 Python 的解决方案：

requests
用于获取网页内容。
pdfkit
将 HTML 转换为 PDF。
PyPDF2
合并 PDF。

先决条件：

安装

pdfkit

和

PyPDF2

软件包：

pip install pdfkit PyPDF2 requests

确保您已安装
```
wkhtmltopdf
```
（由
```
pdfkit
```
使用）：
- Windows：从此处下载并添加到您的路径。
- Mac：通过 Homebrew 安装 (
```
brew install wkhtmltopdf
```
  )。
- Linux：使用包管理器 (
```
sudo apt-get install wkhtmltopdf
```
  )。

脚本步骤：

获取网页
将每个文件转换为PDF
将 PDF 合并为一个
使用当前日期保存合并的 PDF

这是一个简单的 Python 脚本：

import requests
import pdfkit
import os
from PyPDF2 import PdfFileMerger
from datetime import datetime

# URLs of the webpages to fetch
webpages = [
    "https://vajiramandravi.com/upsc-daily-current-affairs/",
    "https://visionias.in/resources/daily_current_affairs.php?type=1",
    "https://www.drishtiias.com/current-affairs-news-analysis-editorials",
    "https://forumias.com/blog/subjectwise-current-affairs-for-upsc-ias-prelims-examination/",
    "https://iasbaba.com/current-affairs-for-ias-upsc-exams/"
]

# Create a directory to store the PDFs if it doesn't exist
pdf_directory = "daily_pdfs"
if not os.path.exists(pdf_directory):
    os.makedirs(pdf_directory)

# List to store paths of generated PDFs
pdf_files = []

# Convert each webpage to a PDF and store the filename
for idx, url in enumerate(webpages):
    pdf_filename = f"{pdf_directory}/webpage_{idx+1}.pdf"
    pdfkit.from_url(url, pdf_filename)
    pdf_files.append(pdf_filename)

# Merge the PDFs into one
merger = PdfFileMerger()

for pdf_file in pdf_files:
    merger.append(pdf_file)

# Save the merged PDF with today's date
today = datetime.now().strftime("%Y-%m-%d")
merged_pdf_filename = f"{pdf_directory}/current_affairs_{today}.pdf"
merger.write(merged_pdf_filename)
merger.close()

print(f"Merged PDF saved as: {merged_pdf_filename}")

运行脚本：

将脚本保存到文件中，例如
```
convert_webpages.py
```
。
打开终端/命令提示符并导航到脚本的位置。
使用
```
python convert_webpages.py
```
运行脚本。
生成的 PDF 将保存在
```
daily_pdfs
```
目录中，文件名中包含当前日期。

故障排除提示：

如果
```
pdfkit
```
引发错误，请检查
```
wkhtmltopdf
```
是否已安装并且位于您的 PATH 中。
如果脚本无法获取网页，请检查 URL 是否已更改。
如果网页上的内容是动态的（JavaScript），
```
pdfkit
```
可能无法正确呈现。

使用此脚本，您可以获取多个网页并将其转换为合并的 PDF。它可以通过更新

webpages

列表来适应其他 URL。

通过Python或其他更好的工具根据日期自动将每日更新的多个网页转换为单个pdf

问题描述投票：0回答：1

1个回答

先决条件：

脚本步骤：

运行脚本：

故障排除提示：

最新问题

通过Python或其他更好的工具根据日期自动将每日更新的多个网页转换为单个pdf

问题描述 投票：0回答：1

1个回答

先决条件：

脚本步骤：

运行脚本：

故障排除提示：

最新问题

问题描述投票：0回答：1