如何使用BeautifulSoup从<a href="TextWithUrlBehind">Something</a>中提取url？

Question

我正在尝试从网页中提取 .json 文件中的一些链接和文本。

我已经解析了HTML tbody > tr > td，每个td包含

<a href="TextWithUrlBehind">Something</a>

但是 Inspect Element 中的这个

TextWithUrlBehind

是可点击的，它附有一个链接。这不是众所周知的

<a href=https//...>

所以，我提取的href是

str: TextWithUrlBehind

，然后是.json文件中的

text(also str):Something

代码如下所示：

            rows = test_results_table.find_all("tr")
            
            # Iterate over each anchor tag
            for row in rows:
                first_cell = row.find("td")
                if first_cell:
                    anchor_tag = first_cell.find("a", href=True)
                    self._debug_print("Anchor tag content:", anchor_tag)
                    if anchor_tag:
                        href = anchor_tag["href"]
                        text = anchor_tag.get_text(strip=True)
                        links.append({"href": href, "text": text})
                        self._debug_print("Content extracted:", {"href": href, "text": text})
                    else:
                        self._debug_print("No anchor tag found in cell:", first_cell)
                else:
                    self._debug_print("No table cell found in row:", row)

我不明白该链接是如何在 HTML 中附加的，我也不知道 beautifulsoup 内置函数如何帮助我获取该链接。

Answer 1

from bs4 import BeautifulSoup as bs
import requests as rq

#Replace <your url> with the url you want to scrap
url ='<your url>'

r=requests.get(url)
soup=bs(r.content,"html.parser")
links = soup.find_all("a") 

# Create an empty dict
dct = {}
for x in links:

    # Get keys of the dict being clickable text and value being links
    key = x.string
    val = x.get("href")
    
print(dct)

输出将是一个字典，其中的键是可单击的文本，值是这些文本单击后所指向的链接。

如何使用BeautifulSoup从<a href="TextWithUrlBehind">Something</a>中提取url？

问题描述投票：0回答：1

1个回答

最新问题

如何使用BeautifulSoup从<a href="TextWithUrlBehind">Something</a>中提取url？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1