如何使用 python 3 抓取亚马逊

问题描述 投票:0回答:4

我正在尝试阅读给定产品的所有评论,这既是为了学习Python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品来编码。

我想阅读的链接是Amazons,我使用urllib来打开链接

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')

当我显示 amazon 时,将链接读入“amazon”变量后,我收到以下消息

print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>

所以我在线阅读,发现我需要使用 read 命令来读取源代码,但有时它会给我一个网页类型的结果,有时则不会

print(amazon.read())
b''

如何阅读该页面,并将其传递给美丽汤?

编辑1

我确实使用了 request.get ,当我检查检索到的页面文本中存在的内容时,我发现以下内容与网站链接不匹配。

print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">

<!--
        To discuss automated access to Amazon data please contact [email protected].
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>

<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>

</center>
</body>
</html>
python web-scraping urllib
4个回答
3
投票

使用您当前的库 urllib。这就是你可以做的!使用 .read() 获取 HTML。然后像这样将其传递到 BeautifulSoup 中。请记住,亚马逊是一个严格反抓取的网站。您获得不同结果的可能性可能是因为 HTML 包含在 JavaScript 中。为此,您可能必须使用 Selenium 或 Dryscrape。您可能还需要将标头/Cookie 和额外属性传递到您的请求中。

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)

编辑 ---- 原来你现在正在使用请求。使用像这样传入我的标头的请求,我可以获得 200 个响应。

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]

--- 使用 Dryscrape

import dryscrape
from bs4 import BeautifulSoup

sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup

##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html

1
投票

我个人会使用 requests 库而不是 urllib。 Requests 有更多功能

import requests

从那里开始类似:

resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)

应该回复这封邮件,因为它是相当简单的 http 请求

编辑: 根据您的错误,您将必须研究要传递的参数,以使您的请求看起来正确。一般来说,对于请求,它看起来像这样(显然是您发现的值 - 检查浏览器调试/开发人员选项以检查您的网络流量并查看使用浏览器时发送到亚马逊的内容):

url = "https://www.base.url.here"
params = {
    'param1': 'value1'
     .....
}
resp = requests.get(url,params)

0
投票

对于网页抓取,请使用

requests
BeautifulSoup
python3中的
模块

安装BeautifulSoup:

pip install beautifulsoup4

发送请求时使用适当的标头。

headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

废品.py

from bs4 import BeautifulSoup

import requests

url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"

headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

response = requests.get(f"{url}", headers=headers)

with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
    file.write(response.text)

bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable

0
投票

你不能再直接抓取亚马逊了,他们的抓取检测很强大,并且他们不再允许非法抓取(登录后)。你必须使用合法的 API,或者抓取提供与亚马逊相同信息的其他网站,他们不介意你直接使用 bs4 和请求抓取他们的数据。我已经在 Pyhton 中为此构建了一个应用程序,并希望在本月获得博士学位后尽快将其发布到 Github。我将我的应用程序命名为“Amazon Book Scraper”,您将在 2024 年 6 月在 Github 上找到它。

© www.soinside.com 2019 - 2024. All rights reserved.