我需要从HTML页面提取艺术家的名字。这是该页面的摘要:
</td>
<td class="playbuttonCell">
<a class="playbutton preview-track" href="/music/example" data-analytics-redirect="false" >
<img class="transparent_png play_icon" width="13" height="13" alt="Play" src="http://cdn.last.fm/flatness/preview/play_indicator.png" style="" />
</a>
</td>
<td class="subjectCell" title="example, played 3 times">
<div>
<a href="/music/example-artist" >Example artist name</a>
我已经尝试过了,但是没有完成任务。
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen('http://www.last.fm/user/Jehl/charts?rangetype=overall&subtype=artists').read()
soup = BeautifulSoup(html)
print soup('a')
for link in soup('a'):
print html
我在哪里弄糟?
for link in soup.select('td.subjectCell a'):
print link.text
selects (just like CSS)元素内具有[[subjectCell类的a
元素。
td
[这里,In [1]: from bs4 import BeautifulSoup In [2]: s = # Your string here... In [3]: soup = BeautifulSoup(s) In [4]: for anchor in soup.find_all('a'): ...: print anchor.text ...: ...: here lies the text i need
方法返回一个列表,其中包含所有匹配的定位标记,然后我们可以打印find_all
属性以获取标记之间的值。
text
工作代码:
spans = soup.find_all("div", {"class": "overlay tran3s"})
for span in spans:
links = span.find_all('a')
for link in links:
print(link.text)
/音乐/示例/ music / example-artist输出:
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
print (link.attrs['href'])
您可以利用>([a-zA-Z]*|[0-9]|(\w\s*)*)</a>
方法直接在锚标记之间捕获文本。