如何从 <tr><td><font size="2"><strong> 标签中删除值

问题描述 投票:0回答:1

我是上周才开始使用 Python 进行网页抓取的新手。我有一个关于如何使用 HTML 标签提取信息的问题,特别是在处理标签时。在网站上“https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange= more&DistrictPageNum=1&ID2=0601620",我想抓取位于左下表格中的“网站”值。

以我有限的经验,我尝试了各种方法,但没有找到解决方案。我正在努力了解如何解决这个问题。我需要对总共 2115 个页面动态执行此提取。我尝试从提供的 HTML 内容中提取的 URL 是“http://www.abcusd.us”。我将非常感谢您提供的任何建议。谢谢你。

<tbody><tr>
    <td valign="top" width="220"><b><font size="2">District Name:</font></b><br>
        <font size="3">ABC Unified<br></font><font size="2"><a href="../schoolsearch/school_list.asp?Search=1&amp;DistrictID=0601620">schools for this district</a></font></td>
    <td valign="top" width="220">
        <b><font size="2">NCES District ID:</font></b><br><font size="3">0601620</font></td>
    <td valign="top"><b><font size="2">State District ID:</font></b><br><font size="3">
        CA-1964212</font></td>
    </tr>
    

    <tr>
    <td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    <td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    <td valign="top"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    </tr>
    <tr>    
<td valign="top" width="220"><b><font size="2">Mailing Address:</font></b><br><font size="3">16700 Norwalk BLVD.<br>Cerritos,&nbsp;CA&nbsp;90703-1838</font></td><td valign="top" width="40%"><strong><font size="2">Physical Address:</font> <a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School &amp; District Navigator" target="_blank"><img style="height:20px;vertical-align:middle;margin-bottom:-20px;margin-top:-32px" src="/ccd/commonfiles/images/mapapp_icon.png" title="Map latest data in the School &amp; District Navigator"></a></strong><br><font size="3"><a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School &amp; District Navigator" target="_blank">16700 Norwalk BLVD.<br>Cerritos,&nbsp;CA&nbsp;90703-1838</a></font></td>

    <td valign="top"><b><font size="2">Phone:</font></b><br><font size="3">
        (562)926-5566</font>
    </td>
    </tr>
    <tr>
    <td valign="top">
        <p align=""><b><font size="2">Type: </font></b><br>
        <font size="3">Regular local school district</font>
    </p></td>
    <td valign="top">
        <p align=""><b><font size="2">Status:</font></b><br>
        <font size="3">Open</font>
    </p></td>
    <td valign="top">
        <p align=""><b><font size="2">Total Schools:</font></b><br>
        <font size="3">31</font>
    </p></td>
    </tr>

    <tr>
        <td valign="top">
            <b><font size="2">Supervisory Union #: </font></b><br>
            <font size="3">N/A</font>
        </td>
        <td valign="top" colspan="2">
            <b><font size="2">Grade Span: </font></b>
                <font size="2"> (grades KG - 12)</font>
            <br>
            
            <table><tbody><tr><td><table border="0" bordercolor="#134F8A" cellspacing="0" cellpadding="0" bgcolor="#134F8A"><tbody><tr><td width="100%" bordercolor="#134F8A"><table border="0" cellspacing="1" cellpadding="0"><tbody><tr><td width="16" align="center" bgcolor="#23619E"><font size="2">&nbsp;&nbsp;</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">KG</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">1</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">2</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">3</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">4</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">5</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">6</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">7</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">8</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">9</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">10</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">11</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">12</font></td></tr></tbody></table></td></tr></tbody></table></td></tr></tbody></table>
        </td>
    </tr>

<tr><td><font size="2"><strong>Website: </strong><br></font><font size="2"><a href="/transfer.asp?location=www.abcusd.us" target="_blank">http://www.abcusd.us</a></font></td><td valign="top" width="40%"><strong><font size="2">District Demographics:</font><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><img style="height:16px;vertical-align:bottom;margin-bottom:4px;margin-left:2px" src="/ccd/commonfiles/images/ddg_icon.png" title="View data for your district in the School District Demographic Dashboard"></a></strong><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><br><font size="2">School District Demographic Dashboard</font></a></td></tr><tr></tr>

    </tbody>

import requests
from bs4 import BeautifulSoup

c_url = 'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'

response = requests.get(c_url)

if response.status_code == 200:
    n_soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all <strong> tags containing "Website:"
    website_strong_tags = n_soup.find_all('strong', text=lambda text: text and "Website:" in text)
    
    # Check if there are any matching <strong> tags
    if website_strong_tags:
        # Get the first matching <strong> tag
        first_website_strong_tag = website_strong_tags[0]
        
        # Extract the "Website" value
        website_value = first_website_strong_tag.next_sibling.strip()
        print("Website:", website_value)
        
        # Find the <a> tag within the same <td> containing the "Website:" text
        a_tag = first_website_strong_tag.find_next('a')
        
        # Check if the <a> tag exists before extracting the URL
        if a_tag:
            url = a_tag['href']
            print("URL:", url)
        else:
            print("No URL found for the 'Website:' link.")
    else:
        print("No 'Website:' value found on the page.")
else:
    print("Failed to retrieve data from the URL. Status code:", response.status_code)

Error : TypeError                                 Traceback (most recent call last)
Cell In[240], line 20
     17 first_website_strong_tag = website_strong_tags[0]
     19 # Extract the "Website" value
---> 20 website_value = first_website_strong_tag.next_sibling.strip()
     21 print("Website:", website_value)
     23 # Find the <a> tag within the same <td> containing the "Website:" text

预计:网站:http://www.abcusd.us

python web-scraping beautifulsoup html-lists
1个回答
0
投票

我认为使用 XPaths 可能是一个不错的方法。但我很快就发现它本身并不支持:

我们可以将 XPath 与 BeautifulSoup 一起使用吗?从技术上来说,不。但我们可以使用 BeautifulSoup4 使用 lxml Python 库来实现这一点。

来源

所以,如果您愿意尝试,这样的事情可能会起作用,我已经针对您的网站测试了我的 XPath,但没有测试其余代码:

import requests
from bs4 import BeautifulSoup
from lxml import etree

c_url = \
    'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'

response = requests.get(c_url)

if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')
    body = soup.find('body')

    dom = etree.HTML(str(body))  # Parse the HTML content of the page
xpath_str = \
    '//strong[text()="Website: "]/parent::font/following-sibling::font/a'  # The XPath which goes to the website URL
print dom.xpath(xpath_str)[0].text
© www.soinside.com 2019 - 2024. All rights reserved.