bs4
只给我变量而不是我正在寻找的值
我的代码:
from bs4 import BeautifulSoup
import requests
URL = "https://github.com/torvalds/linux/graphs/contributors"
page = requests.get(URL)
def getContributers():
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("span", class_="d-block Box")
print(results)
for a in results:
print(a)
# < a
# data - hovercard - type = "user"
# data - hovercard - url = "/users/torvalds/hovercard"
# class ="text-normal" href="/torvalds" > torvalds < /a >
if __name__ == '__main__':
getContributers()
终端输出:
/home/myclm/Desktop/SNA/beautifulSoup/venv/bin/python /home/myclm/Desktop/SNA/beautifulSoup/main.py
[<span class="d-block Box">
<h3 class="border-bottom p-2 lh-condensed">
<a class="d-inline-block mr-2 float-left" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">
<img alt="{{ authorLogin }}" class="avatar avatar-user" height="38" src="{{ authorAvatarUrl }}" width="38"/>
</a>
<span class="f5 text-normal color-fg-muted float-right">{{ place }}</span>
<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>
<span class="f6 d-block color-fg-muted">
<span class="cmeta">
<div>
<a class="Link--secondary text-normal" href="{{ contributorUrl }}">{{ contributorUrlText }}</a>
<span class="additions-deletions d-none">
<span class="color-fg-success text-normal">{{ linesAdded }}</span>
<span class="color-fg-danger text-normal">{{ linesDeleted }}</span>
</span>
</div>
</span>
</span>
</h3>
</span>]
<span class="d-block Box">
<h3 class="border-bottom p-2 lh-condensed">
<a class="d-inline-block mr-2 float-left" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">
<img alt="{{ authorLogin }}" class="avatar avatar-user" height="38" src="{{ authorAvatarUrl }}" width="38"/>
</a>
<span class="f5 text-normal color-fg-muted float-right">{{ place }}</span>
<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>
<span class="f6 d-block color-fg-muted">
<span class="cmeta">
<div>
<a class="Link--secondary text-normal" href="{{ contributorUrl }}">{{ contributorUrlText }}</a>
<span class="additions-deletions d-none">
<span class="color-fg-success text-normal">{{ linesAdded }}</span>
<span class="color-fg-danger text-normal">{{ linesDeleted }}</span>
</span>
</div>
</span>
</span>
</h3>
</span>
Process finished with exit code 0
现在的问题是这一行:
<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>
在 github 页面上,>{{authorLogin}} 被替换为不同的贡献者,我在检查页面时也可以看到这一点,为什么这没有显示在抓取的版本中? 我如何获得名字?
内容通过 API 动态渲染和加载,并且
requests
仅适用于静态响应。
因此通过API获取JSON数据:
import requests
URL = "https://github.com/torvalds/linux/graphs/contributors-data"
def getContributers():
for item in requests.get(URL, headers={'Accept':'application/json'}).json():
print(item.get('author').get('login'), f"https://github.com{item.get('author').get('path')}")
if __name__ == '__main__':
getContributers()
另一种选择是使用
selenium
或任何其他模仿浏览器并可以处理动态内容的模块。