因此,我正在开发一个程序,以从站点下载一些图像,而且我必须以某种方式获取img标签的“ src”部分。我可以用硒来做到这一点,但是我不得不修改代码,现在我正在使用BeautifulSoup4和lxml。目前,我在变量“ mystr”中具有页面(站点)的整个源代码,我想提供一个xpath并在该变量中找到该xpath?可能吗? (大概)我发布此问题的原因是因为我似乎无法将变量解析为lxml并使用其功能.xpath()
-阅读更多问题的上下文-我正在从excel文件中读取一些数据(参考值和url),我想打开url,下载产品图片,然后将其重命名以供参考。我可以使用多个图像来完成此操作,但是当url仅包含1个图像时,我想使用xpath下载该图像,并且我不想再次使用硒。
谢谢。我认为这是与该问题相关的代码部分。
try: #Extrair o html
fp = urllib.request.urlopen(links[i])
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
except Exception as ex: #Exceção do html
print("Não foi possivel extrair o HTML deste url")
erros.append(i)
continue
try: #Passar para Beautiful soup 4
soup = BeautifulSoup(mystr, "lxml")
#print(mystr, file = open("teste.txt", "a"))
except Exception as ex: # Exceção do Beautiful soup 4
print("Não foi possivel converter o HTML para bs4\n\n" + ex)
erros.append(i)
continue
try: #Navegar até ao DIV dentro do html extraido
main_div = soup.find_all("div", {"id": div_id})
if len(main_div) == 0:
parser = etree.HTMLParser()
tree = etree.parse(mybytes, parser)
#print(tree, file=open("tree.txt", "a"))
#image = tree.xpath('//*[@id="image"]')
image = tree.xpath("/html/body/div[1]/div/div/div/div[1]/div[1]/div[1]/a/img")
print(image[0].tag)
#input("--------------------------------------------------")
except Exception as ex: #Exceção se não existir um div dentro do HTML extraido com o ID fornecido
print("Não existe nenhum DIV com o id fornecidon\n\n" + ex)
erros.append(i)
continue
有关xpath的信息,请访问http:Wiki / XPath,或有关使用XPATHS的更多信息。// a / @ href'从所有链接(标记)中选择href属性。对于所有图像src属性,这将是// img / @ src。