使用 BS4 进行 Python HTML 解析

Question

我在尝试使用 Python 和 Beautiful Soup 解析 HTML 时遇到问题，并且遇到了我想要解析非常特定的数据片段的问题。这是我遇到的代码：

<div class="big_div">
   <div class="smaller div">
      <div class="other div">
         <div class="this">A</div>
         <div class="that">2213</div>
      <div class="other div">
         <div class="this">B</div>
         <div class="that">215</div>
      <div class="other div">
         <div class="this">C</div>
         <div class="that">253</div>

您可以看到，有一系列重复的 HTML，只有值不同，我的问题是找到特定值。 我想在最后一个 div 中找到 253。 如果有任何帮助，我将不胜感激，因为这是解析 HTML 时反复出现的问题。

提前谢谢您！

到目前为止，我已经尝试解析它，但因为名称相同，我不知道如何浏览它。我也尝试过使用 for 循环，但几乎没有取得任何进展。

Answer 1

您可以使用字符串属性作为 find 中的参数。字符串属性的 BS 文档。

"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)

req_div

将包含您想要的 div 元素。

Answer 2

page = requests.get('https://habr.com/ru/search/page1/? 
q=Ютуб&target_type=posts&order=relevance').text
page_soup = BeautifulSoup(page, 'html.parser')
count_pages = int(page_soup.find_all('div', 'tm-pagination__page-group')[-1].text.split()[0])
hrefs = []
for i in range(1, count_pages + 1):
    print(i)
    page = requests.get(f'https://habr.com/ru/search/page{i}/?q=Новости&target_type=posts&order=relevance').text
    page_s = BeautifulSoup(page, 'html.parser')
    links = page_s.find_all('article', 'tm-articles-list__item')
    for idx, link in enumerate(links):
        hrefs.append(f'https://habr.com/ru/news/{link["id"]}/')
    
texts = [''] * 1000   
for ind, href in enumerate(hrefs):
    print(ind)
    pagex = requests.get(href).text
    page_su = BeautifulSoup(pagex, 'html.parser')
    try:
        text = page_su.find_all("div", "article-formatted-body article-formatted-body article-formatted-body_version-1")[0].text
        texts[ind] = text
    except:
        ...

使用 BS4 进行 Python HTML 解析

问题描述投票：0回答：2

2个回答

最新问题

使用 BS4 进行 Python HTML 解析

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2