我是上周才开始使用 Python 进行网页抓取的新手。我有一个关于如何使用 HTML 标签提取信息的问题,特别是在处理标签时。在网站上“https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange= more&DistrictPageNum=1&ID2=0601620",我想抓取位于左下表格中的“网站”值。
以我有限的经验,我尝试了各种方法,但没有找到解决方案。我正在努力了解如何解决这个问题。我需要对总共 2115 个页面动态执行此提取。我尝试从提供的 HTML 内容中提取的 URL 是“http://www.abcusd.us”。我将非常感谢您提供的任何建议。谢谢你。
<tbody><tr>
<td valign="top" width="220"><b><font size="2">District Name:</font></b><br>
<font size="3">ABC Unified<br></font><font size="2"><a href="../schoolsearch/school_list.asp?Search=1&DistrictID=0601620">schools for this district</a></font></td>
<td valign="top" width="220">
<b><font size="2">NCES District ID:</font></b><br><font size="3">0601620</font></td>
<td valign="top"><b><font size="2">State District ID:</font></b><br><font size="3">
CA-1964212</font></td>
</tr>
<tr>
<td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
<td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
<td valign="top"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
</tr>
<tr>
<td valign="top" width="220"><b><font size="2">Mailing Address:</font></b><br><font size="3">16700 Norwalk BLVD.<br>Cerritos, CA 90703-1838</font></td><td valign="top" width="40%"><strong><font size="2">Physical Address:</font> <a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School & District Navigator" target="_blank"><img style="height:20px;vertical-align:middle;margin-bottom:-20px;margin-top:-32px" src="/ccd/commonfiles/images/mapapp_icon.png" title="Map latest data in the School & District Navigator"></a></strong><br><font size="3"><a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School & District Navigator" target="_blank">16700 Norwalk BLVD.<br>Cerritos, CA 90703-1838</a></font></td>
<td valign="top"><b><font size="2">Phone:</font></b><br><font size="3">
(562)926-5566</font>
</td>
</tr>
<tr>
<td valign="top">
<p align=""><b><font size="2">Type: </font></b><br>
<font size="3">Regular local school district</font>
</p></td>
<td valign="top">
<p align=""><b><font size="2">Status:</font></b><br>
<font size="3">Open</font>
</p></td>
<td valign="top">
<p align=""><b><font size="2">Total Schools:</font></b><br>
<font size="3">31</font>
</p></td>
</tr>
<tr>
<td valign="top">
<b><font size="2">Supervisory Union #: </font></b><br>
<font size="3">N/A</font>
</td>
<td valign="top" colspan="2">
<b><font size="2">Grade Span: </font></b>
<font size="2"> (grades KG - 12)</font>
<br>
<table><tbody><tr><td><table border="0" bordercolor="#134F8A" cellspacing="0" cellpadding="0" bgcolor="#134F8A"><tbody><tr><td width="100%" bordercolor="#134F8A"><table border="0" cellspacing="1" cellpadding="0"><tbody><tr><td width="16" align="center" bgcolor="#23619E"><font size="2"> </font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">KG</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">1</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">2</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">3</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">4</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">5</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">6</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">7</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">8</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">9</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">10</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">11</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">12</font></td></tr></tbody></table></td></tr></tbody></table></td></tr></tbody></table>
</td>
</tr>
<tr><td><font size="2"><strong>Website: </strong><br></font><font size="2"><a href="/transfer.asp?location=www.abcusd.us" target="_blank">http://www.abcusd.us</a></font></td><td valign="top" width="40%"><strong><font size="2">District Demographics:</font><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><img style="height:16px;vertical-align:bottom;margin-bottom:4px;margin-left:2px" src="/ccd/commonfiles/images/ddg_icon.png" title="View data for your district in the School District Demographic Dashboard"></a></strong><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><br><font size="2">School District Demographic Dashboard</font></a></td></tr><tr></tr>
</tbody>
import requests
from bs4 import BeautifulSoup
c_url = 'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'
response = requests.get(c_url)
if response.status_code == 200:
n_soup = BeautifulSoup(response.text, 'html.parser')
# Find all <strong> tags containing "Website:"
website_strong_tags = n_soup.find_all('strong', text=lambda text: text and "Website:" in text)
# Check if there are any matching <strong> tags
if website_strong_tags:
# Get the first matching <strong> tag
first_website_strong_tag = website_strong_tags[0]
# Extract the "Website" value
website_value = first_website_strong_tag.next_sibling.strip()
print("Website:", website_value)
# Find the <a> tag within the same <td> containing the "Website:" text
a_tag = first_website_strong_tag.find_next('a')
# Check if the <a> tag exists before extracting the URL
if a_tag:
url = a_tag['href']
print("URL:", url)
else:
print("No URL found for the 'Website:' link.")
else:
print("No 'Website:' value found on the page.")
else:
print("Failed to retrieve data from the URL. Status code:", response.status_code)
Error : TypeError Traceback (most recent call last)
Cell In[240], line 20
17 first_website_strong_tag = website_strong_tags[0]
19 # Extract the "Website" value
---> 20 website_value = first_website_strong_tag.next_sibling.strip()
21 print("Website:", website_value)
23 # Find the <a> tag within the same <td> containing the "Website:" text
预计:网站:http://www.abcusd.us
我认为使用 XPaths 可能是一个不错的方法。但我很快就发现它本身并不支持:
我们可以将 XPath 与 BeautifulSoup 一起使用吗?从技术上来说,不。但我们可以使用 BeautifulSoup4 使用 lxml Python 库来实现这一点。
所以,如果您愿意尝试,这样的事情可能会起作用,我已经针对您的网站测试了我的 XPath,但没有测试其余代码:
import requests
from bs4 import BeautifulSoup
from lxml import etree
c_url = \
'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'
response = requests.get(c_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
body = soup.find('body')
dom = etree.HTML(str(body)) # Parse the HTML content of the page
xpath_str = \
'//strong[text()="Website: "]/parent::font/following-sibling::font/a' # The XPath which goes to the website URL
print dom.xpath(xpath_str)[0].text