我需要从 HTML 中提取以下数据:日期、时间、所有四个“Jogos”时间以及是否在第一次、第二次、第三次或第四次出现对勾符号。
但是代码没有返回任何东西。
我的python代码是:
from bs4 import BeautifulSoup
html = """
<div class="body">
<div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
01:08
</div>
<div class="from_name">
🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5
</div>
<div class="text">
Easy Bot - Over 2.5<br><br>🏆 Liga: Sul-Americana<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
</div>
</div>
"""
# parse the HTML
soup = BeautifulSoup(html, 'html.parser')
# extract the date and time
date_time_div = soup.find('div', class_='pull_right date details')
date = date_time_div['title'][:10]
time = date_time_div.text.strip()
jogos_div = soup.find('div', string='⚽ Jogos:')
if jogos_div is not None:
jogos_time = jogos_div.text.split(' ')[2:-1]
checked_jogos = [jogo for jogo in jogos_time if '✅' in jogo]
print('Jogos time:', ', '.join(jogos_time))
print('Checked jogos:', ', '.join(checked_jogos))
else:
print('No jogos found.')
注意复选标记“属于”时间。所以它可以在第一、第二、第三或第四个 Jogos 时间。
所以预期输出是:
日期:XXX 时间:XXX Jogos: XXX, XXX, XXX, XXX 已勾选:1(或 2、3、4,具体取决于复选标记所在的位置)
那么,这段代码有什么问题?
不确定预期结果应该是什么样子或者有多少项需要迭代,所以这段代码应该指向一个方向。
这里的问题是
string='⚽ Jogos:'
需要完全匹配。
from bs4 import BeautifulSoup
html = '''
<div class="body">
<div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
01:08
</div>
<div class="from_name">
🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5
</div>
<div class="text">
Easy Bot - Over 2.5<br><br>🏆 Liga: Sul-Americana<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('.text'):
for s in e.stripped_strings:
if 'Jogos' in s:
s=s.split()[2:]
jogo_times = [t for t in s if '✅' not in t]
jogo_check = [s[i+1] for i,t in enumerate(s) if '✅' in t]
d = {
'date':e.find_previous('div',{'class':'date'}).get('title')[:10],
'time':e.find_previous('div',{'class':'date'}).get_text(strip=True),
'jogo_times':jogo_times,
'jogo_time_checked':jogo_check
}
break
if d:
data.append(d)
d = None
else:
print('no jogo')
data
[{'date': '09.03.2023',
'time': '01:08',
'jogo_times': ['04:10', '04:13', '04:16', '(04:19)'],
'jogo_time_checked': ['04:10']}]
我会这样做:
d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("Date: ",calendar[0])
print("Time: ",calendar[1])
for sts in soup.select_one('div.text').stripped_strings:
if "⚽ Jogos: " in sts:
jugos = (sts.split('⚽ Jogos: ')[1].split(" "))
ind = jugos.index('✅')+1
jugos.remove("✅")
print(jugos)
print("Checkmarked: ", ind)
输出:
Date: 09.03.2023
Time: 01:08:10
['04:10', '04:13', '04:16', '(04:19)']
Checkmarked: 1
当然,您可以将输出添加到列表中,而不是打印。