如何通过Python脚本从网站上获取pdf链接

Question

我经常需要从网站下载pdf，但有时它们不在同一页面上。他们将链接划分为分页，我必须点击每一页获取链接。

我正在学习python，我想编写一些脚本，我可以把weburl和它从该webiste中提取pdf链接。

我是python的新手，所以任何人都可以给我指示我该怎么做

Answer 1

与urllib2，urlparse和lxml相当简单。因为你是Python的新手，所以我更详细地评论了一些事情：

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

结果：

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

Answer 2

如果有很多页面链接你可以尝试优秀的框架 - Scrapy（http://scrapy.org/）。很容易理解如何使用它，并可以下载所需的pdf文件。

Answer 3

通过电话，也许它不是很可读

如果您要从网站上获取所有静态页面或其他内容。您可以轻松地按请求获取HTML

import requests
page_content=requests.get(url)

但是，如果你抓住一些通信网站的东西。会有一些反掠夺的方式。（如何打破这些讨厌的事情将成为问题）

第一种方式：让您的请求更像浏览器（人类）。添加标题（您可以使用Chrome或Fiddle的开发工具来复制标题）制作正确的帖子表单。这个应该复制您通过浏览器发布表单的方式。获取cookie，并将其添加到请求中
第二种方式。使用selenium和浏览器驱动程序。 Selenium将使用真正的浏览器驱动程序（像我一样，我使用chromedriver）记忆添加chromedriver到路径或使用代码加载driver.exe驱动程序= WebDriver.Chrome（路径）不确定这是设置代码 driver.get（url）浏览器通过网页浏览网页，这样可以减少抓取内容的难度获取网页page = driver.page_soruces 一些网站会跳几页。这会导致一些错误。让您的网站等待某些元素显示。 try：certain_element = ExpectedConditions.presenceOfElementLocated（By.id，'youKnowThereIsAElement'sID）WebDriverWait（certain_element）或使用implict等待：等待你喜欢的时间

driver.manage（）。timeouts（）。implicitlyWait（5，TimeUnit.SECONDS）

您可以通过WebDriver控制网站。这里不打算描述。您可以搜索模块。

如何通过Python脚本从网站上获取pdf链接

问题描述投票：2回答：3

3个回答

最新问题

如何通过Python脚本从网站上获取pdf链接

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3