xpath表达式的结果是对象,应该是元素

问题描述 投票:1回答:2

我是新手去抓网站,我一直试图用python刮掉谷歌图片(最终制作一个不和谐机器人,但那是无关紧要的)。我做了以下代码来存储图像src作为列表,所以我可以选择一个索引并显示一个图像(我使用xpath helper chrome扩展测试它,它返回我需要的东西):

from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path =r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
chromedriver_path = r'C:\Users\user\Desktop\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
driver.get("https://www.google.com/search?q=pepega&rlz=1C1GIWA_enGB617GB617&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjH1e6b-YfhAhWRs3EKHeKmAqoQ_AUIDigB&biw=2560&bih=947")

review = driver.find_elements_by_xpath("//div[@jscontroller ='Q7Rsec']/a/img/@src")

print(review)

我得到的错误如下:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\tessst.py", line 8, in <module>
    review = driver.find_elements_by_xpath("//div[@jscontroller ='Q7Rsec']/a/img/@src")
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 410, in find_elements_by_xpath
    return self.find_elements(by=By.XPATH, value=xpath)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 1007, in find_elements
    'value': value})['value'] or []
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: The result of the xpath expression "//div[@jscontroller ='Q7Rsec']/a/img/@src" is: [object Attr]. It should be an element.
  (Session info: chrome=73.0.3683.75)
  (Driver info: chromedriver=73.0.3683.68 (47787ec04b6e38e22703e856e101e840b65afe72),platform=Windows NT 10.0.17134 x86_64)

我认为这是由于xpath查询,因为它说它应该是一个“元素”,但是因为我是新手,我不知道如何将图像SRC作为元素返回。有人可以告诉我如何修复我的查询,以便它在python中不会显示错误?谢谢

编辑:我想我已经做了我想做的事情,我想感谢大家的帮助。根据你的标准,我所做的可能是原始的,但是如果它可以帮助人们的话,分享不好:)

最终代码:

import nltk
from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path =r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
chromedriver_path = r'C:\Users\user\Desktop\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
driver.get("https://www.google.com/search?q=pepega&rlz=1C1GIWA_enGB617GB617&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjH1e6b-YfhAhWRs3EKHeKmAqoQ_AUIDigB&biw=2560&bih=947")

review = driver.find_elements_by_xpath("//div[@jscontroller ='Q7Rsec']/a/img")

imglist = []

for x in review:
    if x.get_attribute("src") != "":
        temp = str(x.get_attribute("src"))
        if temp[0:8] == "https://":
            imglist.append(str(x.get_attribute("src")))


print(imglist)

我突然意识到没有使用nltk(我在某个时候正在玩它而忘记删除它)

xml selenium xpath webdriver screen-scraping
2个回答
0
投票

你不能在缺少的xpath中添加src属性。但是我观察到一些没有src属性的图像。而不是它有data-src属性。这是你的解决方案。希望这有帮助。

from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path =r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
chromedriver_path = r'C:\Users\user\Desktop\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
driver.get("https://www.google.com/search?q=pepega&rlz=1C1GIWA_enGB617GB617&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjH1e6b-YfhAhWRs3EKHeKmAqoQ_AUIDigB&biw=2560&bih=947")


reviews = driver.find_elements_by_xpath("//div[@jscontroller ='Q7Rsec']/a/img")

list_review=[]
for review in reviews:
   if review.get_attribute("src") is not None:
        list_review.append(review.get_attribute("src"))
print(list_review)
print(len(list_review))

0
投票

没关系,我很愚蠢,现在才注意到你的URL在脚本中;我可以确认您的xpath正在找到src元素的img属性。以下应该找到该元素,然后获取其src属性。

review = driver.find_elements_by_xpath("//div[@jscontroller ='Q7Rsec']/a/img")

这将返回100个元素。我不确定你想要对源做什么,但这里有一些代码可以简单地打印每个:

for x in review
    if x.get_attribute.src != "":
        print(x.get_attribute(“src”)

这应该打印具有src指定的55个元素的src属性。

© www.soinside.com 2019 - 2024. All rights reserved.