lxml etree 中的 getpath 显示绝对 xpath 的不同输出

问题描述 投票:0回答:1

我正在尝试获取元素的绝对 XPath 但给出不同的输出。我正在尝试在谷歌中获取搜索按钮的完整XPath代码是:

import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from lxml import etree

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--log-level=3")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
main_link = r"https://www.google.com"
driver.get(main_link)

time.sleep(5)

with open ("dom.xml","w",encoding="utf-8") as domfile:
    domfile.write(driver.page_source)
tree = etree.parse("dom.xml",parser=etree.XMLParser(recover=True))
print(tree)
element = tree.xpath("(//input[@class='gNO89b'])[2]")
print(element)
#trying to print absolute xpath . . 
print (tree.getpath(element[0]))

输出应该是:

/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]

但它给了我:

/html/head/meta/meta/meta/link/script[6]/br/body/div/div[2]/div[2]/form/div/div/div/div[2]/div[2]/div[7]/center/input

python selenium-webdriver web-scraping lxml
1个回答
0
投票

这是因为您正在使用

html
解析
xml
的输出。由于它们是 2 种不同的格式,因此转换时会有一些差异。保留 HTMl 的最佳方法是将其解析为字符串。

import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import lxml.html

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--log-level=3")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
main_link = r"https://www.google.com"

driver.get(main_link)
time.sleep(5)

tree = lxml.html.fromstring(driver.page_source)
root = tree.getroottree()
element = tree.xpath("(//input[@class='gNO89b'])[2]")
print(root.getpath(element[0]))

输出:

/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]

如果您的目标是在解析后将

HTML
文档序列化为
XML
文档,您可能必须考虑先应用一些手动预处理。

© www.soinside.com 2019 - 2024. All rights reserved.