因此,基本上,我正在尝试构建一个网络刮板,以在速卖通网站上找到产品的评论。但是,当我解析html代码时,解析后的代码与我在Chrome的“检查”窗口中看到的代码不同。我无法在解析的代码中找到评论部分。我如何能够完全按照在检查窗口中看到的代码来解析代码?
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
# URl to web scrap from.
page_url = "https://www.aliexpress.com/item/4000042292255.html?
spm=a2g0o.productlist.0.0.4a253632RWxaLa&algo_pvid=c73bf552-ce47-43f6-9abb-
b4a994eeaa01&algo_expid=c73bf552-ce47-43f6-9abb-b4a994eeaa01-0&btsid=2c594979-4027-410a-a7a4-
7246ce06ade7&ws_ab_test=searchweb0_0,searchweb201602_7,searchweb201603_53"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
它是动态生成的,您可以通过渲染对其进行爬网。这是simple_scrapy和pyppeteer的示例。
from simplified_html.request_render import RequestRender
req = RequestRender({ 'executablePath': '/Applications/chrome.app/Contents/MacOS/Google Chrome'})
def callback(html,url,data):
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc(html)
print (doc.title)
req.get('https://www.aliexpress.com/item/4000042292255.html?spm=a2g0o.productlist.0.0.4a253632RWxaLa&algo_pvid=c73bf552-ce47-43f6-9abb-b4a994eeaa01&algo_expid=c73bf552-ce47-43f6-9abb-b4a994eeaa01-0&btsid=2c594979-4027-410a-a7a4-7246ce06ade7&ws_ab_test=searchweb0_0,searchweb201602_7,searchweb201603_53',callback)
结果:
{'tag': 'title', 'html': 'Note 7 pro smartphones 4G LTE celulares 4GB RAM 64GB ROM quad core 13MP camera 18:9 IPS Android mobile phones face ID unlocked-in Cellphones from Cellphones & Telecommunications on AliExpress'}
您可以获取简化的示例here