如何从这个html中获取url?

问题描述 投票:0回答:1

我想用 python 和 beautifulsoup 解析 data-bem 中的 url,但我不能这样做,如何解决这个问题?使用selenium获取html,用beautifulsoup和data-bem解析,就像它不在代码中一样。

 <a target="_blank" class="Link Link_theme_normal OrganicTitle-Link organic__url link i-bem  click" data-bem="{&quot;click&quot;:{&quot;action&quot;:&quot;counter&quot;,&quot;arguments&quot;:{&quot;url&quot;:&quot;https://allrival.com/parsing-saitov?yclid=1693735580326690815&quot;}}}" tabindex="0" href="https://yabs.yandex.ru/count/WUmejI_zOoVX2LdK0VKE0DDQPIOQbKgbKga4mGHzFfSxUxRVkVE6Er-_u_M6EzmQHaf2rAHQHf5G4XBrY0IHKT288MeD8QIX8QhE3fYo6K0J6aRIAb1wcmyMD90tnQ3ZwBTGOUqxu1h_-dIs2XmU4afrlr5xqaQddE-4HBUq3-7b1w_tm7r2dT6QHOzF_NpQDNMeXeq_exRJESNnQDIPsxOjhQLudjlvFRLpA9uxfRO5EDWIigwoT4mPSgLm0nC7Ep9Oqc69e1GRBpt60CPM0I99EpcWwxfUesyGtAQrYMtvYAJXUfdOufskvXEL6zUcg0KYXsrIEL7OpjqnnP1Ze7Gi7fEERQEWjDAYGcVp0QQhL2gNOALIsNn2jQ3A61YC1zYcdaVhQtxfSPDBZvlxpHNm3mk95pzOW2zJ_mjZnUnxOQoOVUw4DltlCBvMljQ3OjpFUb2COvf3S9bbPf8TpB457ub9mXgwE67ftJ_5TQ6pdigNBrofElDa6Zx5GJ-cFPY38f91iVyciDXO1UotzTokflLQ6MiJfPzD0of_aWYy3IG0WPCO0HRy-nGDamySHhoyQ0rpLtzxLqiAI5vJ-AbMc-LZTZdS_6CvD5EwCc6EJR835SkZ7n0gWZUtcSuHnL5SKPGfkhQ4jVhDfP00~2?etext=2202.EFrpnwCJ8OLr9ZhqyYcdw2rxonJODkRkUgvE4hK6Ekp2ZHh6YWRoaXVvZ3Jzbnlu.3dd14f5bf036fff28e539bc2f6b9a2ac5c4c2104&amp;from=ya.ru%3Bsearch%26%23x2F%3B%3Bweb%3B%3B0%3B&amp;q=%D0%BF%D0%B0%D1%80%D1%81%D0%B8%D0%BD%D0%B3&amp;baobab_event_id=lsusc3y2k7" data-counter="[&quot;b&quot;]" data-log-node="2_cbmrw0g-00" data-event-required="true"><div class="Favicon Favicon_size_m favicon"><div class="Favicon-Icon favicon__icon Favicon-Page0 Favicon-Page0_pos_18" style="width:16px;height:16px;background-size:16px;background-position-y:-288px"></div></div><h2 class="OrganicTitle-LinkText Typo Typo_text_l Typo_line_m organic__url-text"><span class="OrganicTitleContentSpan organic__title" role="text"><b>Парсинг</b> сайтов конкурентов, Поставщиков, контроль цен</span></h2></a>

python parsing selenium-webdriver beautifulsoup lxml
1个回答
0
投票
from bs4 import BeautifulSoup
import json

html = """<a target="_blank" class="Link Link_theme_normal OrganicTitle-Link organic__url link i-bem  click" data-bem="{&quot;click&quot;:{&quot;action&quot;:&quot;counter&quot;,&quot;arguments&quot;:{&quot;url&quot;:&quot;https://allrival.com/parsing-saitov?yclid=1693735580326690815&quot;}}}" tabindex="0" href="https://yabs.yandex.ru/count/WUmejI_zOoVX2LdK0VKE0DDQPIOQbKgbKga4mGHzFfSxUxRVkVE6Er-_u_M6EzmQHaf2rAHQHf5G4XBrY0IHKT288MeD8QIX8QhE3fYo6K0J6aRIAb1wcmyMD90tnQ3ZwBTGOUqxu1h_-dIs2XmU4afrlr5xqaQddE-4HBUq3-7b1w_tm7r2dT6QHOzF_NpQDNMeXeq_exRJESNnQDIPsxOjhQLudjlvFRLpA9uxfRO5EDWIigwoT4mPSgLm0nC7Ep9Oqc69e1GRBpt60CPM0I99EpcWwxfUesyGtAQrYMtvYAJXUfdOufskvXEL6zUcg0KYXsrIEL7OpjqnnP1Ze7Gi7fEERQEWjDAYGcVp0QQhL2gNOALIsNn2jQ3A61YC1zYcdaVhQtxfSPDBZvlxpHNm3mk95pzOW2zJ_mjZnUnxOQoOVUw4DltlCBvMljQ3OjpFUb2COvf3S9bbPf8TpB457ub9mXgwE67ftJ_5TQ6pdigNBrofElDa6Zx5GJ-cFPY38f91iVyciDXO1UotzTokflLQ6MiJfPzD0of_aWYy3IG0WPCO0HRy-nGDamySHhoyQ0rpLtzxLqiAI5vJ-AbMc-LZTZdS_6CvD5EwCc6EJR835SkZ7n0gWZUtcSuHnL5SKPGfkhQ4jVhDfP00~2?etext=2202.EFrpnwCJ8OLr9ZhqyYcdw2rxonJODkRkUgvE4hK6Ekp2ZHh6YWRoaXVvZ3Jzbnlu.3dd14f5bf036fff28e539bc2f6b9a2ac5c4c2104&amp;from=ya.ru%3Bsearch%26%23x2F%3B%3Bweb%3B%3B0%3B&amp;q=%D0%BF%D0%B0%D1%80%D1%81%D0%B8%D0%BD%D0%B3&amp;baobab_event_id=lsusc3y2k7" data-counter="[&quot;b&quot;]" data-log-node="2_cbmrw0g-00" data-event-required="true"><div class="Favicon Favicon_size_m favicon"><div class="Favicon-Icon favicon__icon Favicon-Page0 Favicon-Page0_pos_18" style="width:16px;height:16px;background-size:16px;background-position-y:-288px"></div></div><h2 class="OrganicTitle-LinkText Typo Typo_text_l Typo_line_m organic__url-text"><span class="OrganicTitleContentSpan organic__title" role="text"><b>Парсинг</b> сайтов конкурентов, Поставщиков, контроль цен</span></h2></a>"""

soup = BeautifulSoup(html, "html.parser")

# Find the anchor tag 
anchor_tag = soup.find("a", class_="Link")

# Extract the value of the data-bem attribute
data_bem_value = anchor_tag.get("data-bem")

# Parse data-bem attribute as JSON
data_bem_json = json.loads(data_bem_value)

# Extract the url
url = data_bem_json["click"]["arguments"]["url"]

print(url)

输出:

© www.soinside.com 2019 - 2024. All rights reserved.