Python BS4 各种 HTML 标签

Question

我正在尝试使用 BS4 在 Python 中进行网页抓取时达到特定值。网页是https://openaccess.thecvf.com/CVPR2021?day=all 我能够通过标签 ptitle 获取标题并显示 1660 这是准确的。但我不确定如何降低。我使用以下内容来获取标题

    import requests, bs4
    webpage = requests.get(url) #url saved as the above
    soup = bs4.BeautifulSoup(webpage.content, 'html.parser')
    titles = soup.select('dt', {'class' : 'ptitle'})
    print(len(titles) #output is 1660

看起来要遍历的剩余标签是 dd > form id > 然后提取每个标题的每个表单 id 的值以存储在字典中是我的目标。

我为此做了很多研究并试图找到解决方案，但我发现的所有内容都只会产生错误。不要回忆起我此时尝试过的每件事。把它放下一会儿然后再回来。似乎有很多只是执行此选项，但没有 HTML 执行相同的操作，因此一种解决方案对我不起作用。我发现其他人问这个人的许多解决方案都说做这个 soup.select(tag, tag) 或类似的东西，我需要逐步执行的标签完全不同。我还浏览了 BS4 文档页面并尝试了其中的一些，但仍然产生了越界错误或许多参数或类似的东西。如有任何帮助，我们将不胜感激。

Answer 1

您可以使用

tag.find_next()

获取下一个带有名称的

<dd>

标签。然后搜索此

<form>

标签内的所有

<dd>

：

import bs4
import requests

url = "https://openaccess.thecvf.com/CVPR2021?day=all"

soup = bs4.BeautifulSoup(requests.get(url).content, "html.parser")

for title in soup.select("dt.ptitle"):
    title_name = title.text
    names = [name.text.strip(" ,\n\r") for name in title.find_next("dd").select("form")]
    print(title_name)
    print(", ".join(names))
    print("-" * 80)

打印：


...

Data-Free Knowledge Distillation for Image Super-Resolution
Yiman Zhang, Hanting Chen, Xinghao Chen, Yiping Deng, Chunjing Xu, Yunhe Wang
--------------------------------------------------------------------------------
PluckerNet: Learn To Register 3D Line Reconstructions
Liu Liu, Hongdong Li, Haodong Yao, Ruyi Zha
--------------------------------------------------------------------------------
Deep Perceptual Preprocessing for Video Coding
Aaron Chadha, Yiannis Andreopoulos
--------------------------------------------------------------------------------
Explaining Classifiers Using Adversarial Perturbations on the Perceptual Ball
Andrew Elliott, Stephen Law, Chris Russell
--------------------------------------------------------------------------------
DARCNN: Domain Adaptive Region-Based Convolutional Neural Network for Unsupervised Instance Segmentation in Biomedical Images
Joy Hsu, Wah Chiu, Serena Yeung
--------------------------------------------------------------------------------

Python BS4 各种 HTML 标签

问题描述投票：0回答：1

1个回答

最新问题

Python BS4 各种 HTML 标签

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1