使用BeautifulSoup提取图像标题和图像url

Question

我正在尝试使用 BeautifulSoup 从文章中提取图像 url 和图像标题。我可以将文章的图像 url 和图像标题与前面和后面的 HTML 分开，但我不知道如何将这两者与它们的 html 标签分开。这是我的代码：

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-
letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})

我试图提取的两个部分是 src= 和 title= 部分。任何有关如何完成这两个解析的想法将不胜感激。

Answer 1

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
print [i.find('img')['src'] for i in links]
print [i.find('img')['title'] for i in links]

Answer 2

尝试以下方法提取所有图像标签

img = soup.findAll('img')
#depending on how many images are here you will probably need to loop through img
src = img.get('src')
title = img.get('title')

Answer 3

迟到的答案，但你可以使用：

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html5lib")
links = soup.find_all('div', {'class': 'image'})
if links:
    print(links[0].find('img')['src'])
    print(links[0].find('img')['title'])

输出：

http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950
荷兰哲学家 Koert van Mensvoort – Next Nature 创始人科技大学“Next Nature”网络和研究员埃因霍温——写了一封“致人类的信”来支持国际地球日。（PRNewsfoto/Next Nature Network）

Answer 4

在此处输入链接描述感谢@N cheadle

尝试以下方法提取所有图像标签

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml') # or use BeautifulSoup(html, 'html.parser')
imgs = soup.findAll('img') # output: [ .... ]
#depending on how many images are here you will probably need to loop through img
for x in imgs:
  src = x.get('src') #output : 'test.cpm/images/1.png'
  title = x.get('title') #output : 'flover image'

使用BeautifulSoup提取图像标题和图像url

问题描述投票：0回答：4

4个回答

最新问题

使用BeautifulSoup提取图像标题和图像url

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4