Python 网络爬虫和“获取”html 源代码

Question

所以我哥哥想让我用 Python 写一个网络爬虫（自学），我懂 C++、Java 和一点 html。我正在使用 2.7 版本并阅读 python 库，但有一些问题 1.

httplib.HTTPConnection

和

request

概念对我来说是新的，我不明白它是否下载像cookie或实例这样的html脚本。如果您同时执行这两项操作，您可以获得网站页面的源代码吗？我需要知道哪些单词才能修改页面并返回修改后的页面。

仅作为背景，我需要下载一个页面并用我拥有的图像替换任何图像

如果你们能告诉我你们对 2.7 和 3.1 的看法那就太好了

Answer 1

~~使用Python 2.7，目前有更多第三方库。~~（编辑：见下文）。

我推荐你使用stdlib模块

urllib2

，它可以让你轻松获取网络资源。示例：

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

要解析代码，请查看

BeautifulSoup

。

顺便说一句：你到底想做什么：

仅作为背景，我需要下载一个页面并用我拥有的图像替换任何图像

编辑： 现在已经是 2014 年了，大部分重要的库都已经被移植了，如果可以的话，你绝对应该使用 Python 3。

python-requests

是一个非常好的高级库，比

urllib2

更容易使用。

Answer 2

@leoluk 提到的

python3

和

requests

库的示例：

pip install requests

脚本req.py：

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

现在执行即可得到localhost的html源码了！

python3 req.py

Answer 3

如果您使用

Python > 3.x

，则无需安装任何库，这是直接构建在python框架中的。旧的

urllib2

软件包已重命名为

urllib

:

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

Answer 4

您需要做的第一件事是阅读 HTTP 规范，它将解释您期望通过网络接收到的内容。内容内返回的数据将是“呈现的”网页，而不是源。源可以是 JSP、servlet、CGI 脚本，简而言之，几乎任何东西，但您无法访问它们。您只能获得服务器发送给您的 HTML。如果是静态 HTML 页面，那么是的，您将看到“源”。但对于其他任何内容，您都会看到生成的 HTML，而不是源代码。

当您说

modify the page and return the modified page

时，您的意思是什么？

Answer 5

上述所有内容都将在 Cloudflare 后面的 https 请求上失败。您可以尝试此方法来获取 http 和 https html:

import requests
url = 'https://your.link.here'   
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    print(response.text)
else:
    print(f'Request failed with status code: {response.status_code}')

Python 网络爬虫和“获取”html 源代码

问题描述投票：0回答：5

5个回答

最新问题

Python 网络爬虫和“获取”html 源代码

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5