早上好 我是 python 的新手,想检索所有产品图片。
这是我的代码:
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("https://www.boulanger.com/ref/1176193")
if page.status_code == 200:
parsedPage = bs(page.content, 'lxml')
div_product = parsedPage.find('article',{'class':'container-xl'})
title = div_product.find('h1',{'class':'product-title__main'}).get_text()
title = '\n'.join([ligne.strip() for ligne in title.split('\n') if ligne.strip() != '']).replace('\n', ' ')
ref = div_product.find('div',{'class':'product-title__ref'}).get_text()
img_container = div_product.find_('div',{'class':'thumb'})
foto = img_container.find('img')
#data = {
# 'titre': title,
# 'ref': ref,
# 'foto': img_link
# }
print(foto)
我收到以下错误消息:
File "C:\Users\Hervé\Downloads\Scraper\Boulanger.py", line 12, in <module>
img_container = div_product.find_('div',{'class':'thumb'})
TypeError: 'NoneType' object is not callable
总是首先,看看你的
soup
或使用ctrl+U
静态源,看看是否所有预期的成分都到位。
重点是页面内容是动态编译渲染的,
requests
本身是不支持的。最初调用页面时只有少量信息可用,但我们可以使用它,而无需调用模仿浏览器行为的 selenium
等替代方法。
Step#1:使用
BeautifulSoup
找到带有产品信息的<script>
。
data = json.loads(
BeautifulSoup(
requests.get(url).content
).select_one('script:has(+header)').text
)
Step#2:提取标识符的值
gtin13
.
data.get("gtin13")
Step#3:使用提取的值调用scene7 api并提取JSON结构
media_data = json.loads(requests.get(f'https://boulanger.scene7.com/is/image/Boulanger/{data.get("gtin13")}_mixed?req=set,json,UTF-8').text.split('(')[1][:-5])
Step#4:迭代项目并提取路径
[f'https://boulanger.scene7.com/is/image/{e["i"]["n"]}' for e in media_data['set']['item']]
最后但同样重要的是,您还可以将列表添加到
data
,您可能会在结构化字典中拥有所有相关信息。
data['media_list'] = [f'https://boulanger.scene7.com/is/image/{e["i"]["n"]}' for e in media_data['set']['item']]
import requests, json
from bs4 import BeautifulSoup
url = 'https://www.boulanger.com/ref/1176193'
data = json.loads(
BeautifulSoup(
requests.get(url).content
).select_one('script:has(+header)').text
)
media_data = json.loads(requests.get(f'https://boulanger.scene7.com/is/image/Boulanger/{data.get("gtin13")}_mixed?req=set,json,UTF-8').text.split('(')[1][:-5])
[f'https://boulanger.scene7.com/is/image/{e["i"]["n"]}' for e in media_data['set']['item']]
['https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_0',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_1',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_2',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_3',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_4',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_5',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_6',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_7',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_f_l_8',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_e_l_0',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_0',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_1',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_2',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_3',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_4',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_5',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_h_b_l_6',
'https://boulanger.scene7.com/is/image/Boulanger/8806092967915_v_0']