我正在尝试编写可以抓取给定网站的程序。到目前为止,我有这个:
from lxml import html
import requests
page = requests.get('https://www.cruiseplum.com/search#{"numPax":2,"geo":"US","portsMatchAll":true,"numOptionsShown":20,"ppdIncludesTaxTips":true,"uiVersion":"split","sortTableByField":"dd","sortTableOrderDesc":false,"filter":null}')
tree = html.fromstring(page.content)
date = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[1]/text()')
ship = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[2]/text()')
length = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[4]/text()')
meta = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[6]/text()')
price = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[7]/text()')
print('Date: ', date)
print('Ship: ', ship)
print('Length: ', length)
print('Meta: ', meta)
print('Price: ', price)
运行时,列表返回空。
我对python和编码一般都是新手,很感谢你们能提供的任何帮助!
谢谢
我没有看到解决此问题的简单方法。点击“是”会导致执行JavaScript操作,而不是使用其他参数进行的实际重定向。
我建议使用硒之类的东西来完成此操作。
https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}
第二,当您使用requests获取响应对象时,不返回表内的内容数据被隐藏:
from lxml import html import requests u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}' r = requests.get(u) t = html.fromstring(r.content) for i in t.xpath('//tr//text()'): print(i)
这将返回:
Recent update: new computer-optimized interface and new filters Want to track your favorite cruises? Login or sign up to get started. Login / Sign Up Loading... Email status Unverified My favorites & alerts Log out Want to track your favorite cruises? Login or sign up to get started. Login / Sign Up Loading... Email status Unverified My favorites & alerts Log out Date Colors: (vs. selected) Lowest Price Lower Price Same Price Higher Price
即使使用requests_html,内容仍然隐藏
from requests_html import HTMLSession session = HTMLSession() r = session.get(u)
您将需要使用selenium访问隐藏的html内容:
from lxml import html from selenium import webdriver import time u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}' driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe") driver.get(u) time.sleep(2) driver.find_element_by_id('restoreSettingsYesEncl').click() time.sleep(10) #wait until the website downoad data, without this we can't move on elem = driver.find_element_by_xpath("//*") source_code = elem.get_attribute("innerHTML") t = html.fromstring(source_code) for i in t.xpath('//td[@class="dc-table-column _1"]/text()'): print(i.strip()) driver.quit()
此返回第一列(船只名称):
Costa Luminosa Navigator Of The Seas Navigator Of The Seas Carnival Ecstasy Carnival Ecstasy Carnival Ecstasy Carnival Victory Carnival Victory Carnival Victory Costa Favolosa Costa Favolosa Costa Favolosa Costa Smeralda Carnival Inspiration Carnival Inspiration Carnival Inspiration Costa Smeralda Costa Smeralda Disney Dream Disney Dream
正如您所看到的,现在可以使用硒的get_attribute(“ innerHTML”)] >>访问表中的内容[下一步是抓取行(船只,路线,天数,地区..)并将它们存储在csv文件(或任何其他格式)中,然后对所有4051页执行此操作。