如何从HTML中提取特定数据?

问题描述 投票:0回答:2

我需要从 HTML 中提取以下数据:日期、时间、所有四个“Jogos”时间以及是否在第一次、第二次、第三次或第四次出现对勾符号。

但是代码没有返回任何东西。

我的python代码是:


from bs4 import BeautifulSoup

html = """
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5 
   </div>
   <div class="text">
      Easy Bot - Over 2.5<br><br>🏆 Liga: Sul-Americana<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
   </div>
</div>
"""

# parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# extract the date and time
date_time_div = soup.find('div', class_='pull_right date details')
date = date_time_div['title'][:10]
time = date_time_div.text.strip()

jogos_div = soup.find('div', string='⚽ Jogos:')
if jogos_div is not None:
    jogos_time = jogos_div.text.split(' ')[2:-1]
    checked_jogos = [jogo for jogo in jogos_time if '✅' in jogo]
    print('Jogos time:', ', '.join(jogos_time))
    print('Checked jogos:', ', '.join(checked_jogos))
else:
    print('No jogos found.')

注意复选标记“属于”时间。所以它可以在第一、第二、第三或第四个 Jogos 时间。

所以预期输出是:

日期:XXX 时间:XXX Jogos: XXX, XXX, XXX, XXX 已勾选:1(或 2、3、4,具体取决于复选标记所在的位置)

那么,这段代码有什么问题?

python-3.x beautifulsoup html-parsing
2个回答
0
投票

不确定预期结果应该是什么样子或者有多少项需要迭代,所以这段代码应该指向一个方向。


这里的问题是

string='⚽ Jogos:'
需要完全匹配。

例子

from bs4 import BeautifulSoup

html = '''
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5 
   </div>
   <div class="text">
      Easy Bot - Over 2.5<br><br>🏆 Liga: Sul-Americana<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
   </div>
</div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.text'):    
    for s in e.stripped_strings:
        if 'Jogos' in s:
            s=s.split()[2:]
            jogo_times = [t for t in s if '✅' not in t]
            jogo_check = [s[i+1] for i,t in enumerate(s) if '✅' in t]
            d = {
                'date':e.find_previous('div',{'class':'date'}).get('title')[:10],
                'time':e.find_previous('div',{'class':'date'}).get_text(strip=True),
                'jogo_times':jogo_times,
                'jogo_time_checked':jogo_check
            }
            break
    if d:
        data.append(d)
        d = None   
    else:
        print('no jogo')
data

输出

[{'date': '09.03.2023',
  'time': '01:08',
  'jogo_times': ['04:10', '04:13', '04:16', '(04:19)'],
  'jogo_time_checked': ['04:10']}]

0
投票

我会这样做:

d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("Date: ",calendar[0])
print("Time: ",calendar[1])
for sts in soup.select_one('div.text').stripped_strings:
    if "⚽ Jogos: " in sts:
        jugos = (sts.split('⚽ Jogos: ')[1].split(" "))
        ind = jugos.index('✅')+1
        jugos.remove("✅")
        print(jugos)
        print("Checkmarked: ", ind)

输出:

Date:  09.03.2023
Time:  01:08:10
['04:10', '04:13', '04:16', '(04:19)']
Checkmarked:  1

当然,您可以将输出添加到列表中,而不是打印。

© www.soinside.com 2019 - 2024. All rights reserved.