使用 Python 和 Beautiful Soup 从非结构化 HTML 中提取文本

Question

对于下面的 HTML 代码，如何使用正则表达式和 Beautiful Soup 以及 Python Requests Library 提取 标签后面的 aaa、bbb
的内容

<html>
<head></head>
<body>
    <table style="max-width: 600px; margin: auto;">
        <tbody>
            <tr>
                <td>Swan</td>
                <td>Flower</td>
            </tr>
            <tr>
                <td colspan="2" style="background: #ffffff;">
                    <h5>Playground</h5>
                </td>
            </tr>
            <tr>
                <td colspan="2">
                    <strong>Animal:</strong>
                    <br>aaa</td>
            </tr>
            <tr>
                <td colspan="2">
                    <strong>Fish:</strong>
                    <br>bbb</td>
            </tr>
            <tr>
                <td colspan="2" style="text-align: center;">
                    <form method="post">
                        <input type="hidden" name="yyy" value="7777">
                        <input type="hidden" name="rrr" value="wssss">
                        <input type="submit" value="djd ddd" style="width: 250px;">
                    </form>
                </td>
            </tr>
        </tbody>
    </table>
</body>

我尝试了下面的代码，但它似乎不起作用

import requests
from bs4 import BeautifulSoup

params = {
'api_key': 'APIKEY', 
'custom_cookies': 'PHPSESSID=SESSIONID,domain=DOMAIN.com;',}

response = requests.get(
url='www.example.com',
params=params,
timeout=120,
)

soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('td', {'colspan': '2', 'strong': True})

#Extract the text content from each matching element
for result in results:
    br_tag = result.find('br')
    if br_tag:
        text_content = br_tag.next_sibling.strip()
        print(text_content)

预期输出

aaa bb

我得到的输出

[]

Answer 1

尝试：

from bs4 import BeautifulSoup

html_text = """\
<html>
<head></head>
<body>
    <table style="max-width: 600px; margin: auto;">
        <tbody>
            <tr>
                <td>Swan</td>
                <td>Flower</td>
            </tr>
            <tr>
                <td colspan="2" style="background: #ffffff;">
                    <h5>Playground</h5>
                </td>
            </tr>
            <tr>
                <td colspan="2">
                    <strong>Animal:</strong>
                    <br>aaa</td>
            </tr>
            <tr>
                <td colspan="2">
                    <strong>Fish:</strong>
                    <br>bbb</td>
            </tr>
            <tr>
                <td colspan="2" style="text-align: center;">
                    <form method="post">
                        <input type="hidden" name="yyy" value="7777">
                        <input type="hidden" name="rrr" value="wssss">
                        <input type="submit" value="djd ddd" style="width: 250px;">
                    </form>
                </td>
            </tr>
        </tbody>
    </table>
</body>"""


soup = BeautifulSoup(html_text, "html.parser")

for td in soup.select("td:has(strong)"):
    text = list(td.stripped_strings)[-1]
    print(text)

打印：

aaa
bbb

使用 Python 和 Beautiful Soup 从非结构化 HTML 中提取文本

问题描述投票：0回答：1

1个回答

最新问题

使用 Python 和 Beautiful Soup 从非结构化 HTML 中提取文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1