tree.xpath返回空列表

问题描述 投票:0回答:2

我正在尝试编写可以抓取给定网站的程序。到目前为止,我有这个:

from lxml import html
import requests

page = requests.get('https://www.cruiseplum.com/search#{"numPax":2,"geo":"US","portsMatchAll":true,"numOptionsShown":20,"ppdIncludesTaxTips":true,"uiVersion":"split","sortTableByField":"dd","sortTableOrderDesc":false,"filter":null}')

tree = html.fromstring(page.content)

date = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[1]/text()')

ship = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[2]/text()')

length = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[4]/text()')

meta = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[6]/text()')

price = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[7]/text()')

print('Date: ', date)
print('Ship: ', ship)
print('Length: ', length)
print('Meta: ', meta)
print('Price: ', price)

运行时,列表返回空。

我对python和编码一般都是新手,很感谢你们能提供的任何帮助!

谢谢

python xpath lxml
2个回答
0
投票
问题似乎是您导航到的URL。在浏览器中导航到该URL会出现提示,询问您是否要还原加书签的搜索。

我没有看到解决此问题的简单方法。点击“是”会导致执行JavaScript操作,而不是使用其他参数进行的实际重定向。

我建议使用硒之类的东西来完成此操作。


0
投票
首先,您使用的链接不正确;这是正确的链接(单击“是”按钮后(网站将下载数据并将其返回此链接)):

https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}

第二,当您使用

requests获取响应对象时,不返回表内的内容数据被隐藏:

from lxml import html import requests u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}' r = requests.get(u) t = html.fromstring(r.content) for i in t.xpath('//tr//text()'): print(i)
这将返回:

Recent update: new computer-optimized interface and new filters Want to track your favorite cruises? Login or sign up to get started. Login / Sign Up Loading... Email status Unverified My favorites & alerts Log out Want to track your favorite cruises? Login or sign up to get started. Login / Sign Up Loading... Email status Unverified My favorites & alerts Log out Date Colors: (vs. selected) Lowest Price Lower Price Same Price Higher Price

即使使用requests_html,内容仍然隐藏

from requests_html import HTMLSession session = HTMLSession() r = session.get(u)

您将需要使用

selenium访问隐藏的html内容:

from lxml import html from selenium import webdriver import time u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}' driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe") driver.get(u) time.sleep(2) driver.find_element_by_id('restoreSettingsYesEncl').click() time.sleep(10) #wait until the website downoad data, without this we can't move on elem = driver.find_element_by_xpath("//*") source_code = elem.get_attribute("innerHTML") t = html.fromstring(source_code) for i in t.xpath('//td[@class="dc-table-column _1"]/text()'): print(i.strip()) driver.quit()
此返回第一列(船只名称):

Costa Luminosa Navigator Of The Seas Navigator Of The Seas Carnival Ecstasy Carnival Ecstasy Carnival Ecstasy Carnival Victory Carnival Victory Carnival Victory Costa Favolosa Costa Favolosa Costa Favolosa Costa Smeralda Carnival Inspiration Carnival Inspiration Carnival Inspiration Costa Smeralda Costa Smeralda Disney Dream Disney Dream

正如您所看到的,现在可以使用硒的

get_attribute(“ innerHTML”)] >>访问表中的内容[下一步是抓取行(船只,路线,天数,地区..)并将它们存储在csv文件(或任何其他格式)中,然后对所有4051页执行此操作。

© www.soinside.com 2019 - 2024. All rights reserved.