如何使用BeautifulSoup抓取github?

问题描述 投票:0回答:1

bs4
只给我变量而不是我正在寻找的值

我的代码:

from bs4 import BeautifulSoup
import requests

URL = "https://github.com/torvalds/linux/graphs/contributors"
page = requests.get(URL)

def getContributers():
    soup = BeautifulSoup(page.content, "html.parser")

    results = soup.find_all("span", class_="d-block Box")
    print(results)
    for a in results:
        print(a)

#    < a
#   data - hovercard - type = "user"
#    data - hovercard - url = "/users/torvalds/hovercard"
#    class ="text-normal" href="/torvalds" > torvalds < /a >

if __name__ == '__main__':
   getContributers()

终端输出:

/home/myclm/Desktop/SNA/beautifulSoup/venv/bin/python /home/myclm/Desktop/SNA/beautifulSoup/main.py 
[<span class="d-block Box">
<h3 class="border-bottom p-2 lh-condensed">
<a class="d-inline-block mr-2 float-left" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">
<img alt="{{ authorLogin }}" class="avatar avatar-user" height="38" src="{{ authorAvatarUrl }}" width="38"/>
</a>
<span class="f5 text-normal color-fg-muted float-right">{{ place }}</span>
<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>
<span class="f6 d-block color-fg-muted">
<span class="cmeta">
<div>
<a class="Link--secondary text-normal" href="{{ contributorUrl }}">{{ contributorUrlText }}</a>
<span class="additions-deletions d-none">
                      
                    <span class="color-fg-success text-normal">{{ linesAdded }}</span>
                      
                    <span class="color-fg-danger text-normal">{{ linesDeleted }}</span>
</span>
</div>
</span>
</span>
</h3>
</span>]
<span class="d-block Box">
<h3 class="border-bottom p-2 lh-condensed">
<a class="d-inline-block mr-2 float-left" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">
<img alt="{{ authorLogin }}" class="avatar avatar-user" height="38" src="{{ authorAvatarUrl }}" width="38"/>
</a>
<span class="f5 text-normal color-fg-muted float-right">{{ place }}</span>
<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>
<span class="f6 d-block color-fg-muted">
<span class="cmeta">
<div>
<a class="Link--secondary text-normal" href="{{ contributorUrl }}">{{ contributorUrlText }}</a>
<span class="additions-deletions d-none">
                      
                    <span class="color-fg-success text-normal">{{ linesAdded }}</span>
                      
                    <span class="color-fg-danger text-normal">{{ linesDeleted }}</span>
</span>
</div>
</span>
</span>
</h3>
</span>

Process finished with exit code 0

现在的问题是这一行:

<a class="text-normal" data-hovercard-type="user" data-hovercard-url="{{ hoverCardUrl }}" href="{{ authorUrl }}">{{ authorLogin }}</a>

在 github 页面上,>{{authorLogin}} 被替换为不同的贡献者,我在检查页面时也可以看到这一点,为什么这没有显示在抓取的版本中? 我如何获得名字?

python web-scraping beautifulsoup python-requests
1个回答
0
投票

内容通过 API 动态渲染和加载,并且

requests
仅适用于静态响应。

因此通过API获取JSON数据:

import requests
URL = "https://github.com/torvalds/linux/graphs/contributors-data"

def getContributers():
    
    for item  in requests.get(URL, headers={'Accept':'application/json'}).json():
        print(item.get('author').get('login'), f"https://github.com{item.get('author').get('path')}")

if __name__ == '__main__':
   getContributers()

另一种选择是使用

selenium
或任何其他模仿浏览器并可以处理动态内容的模块。

© www.soinside.com 2019 - 2024. All rights reserved.