提取href目标锚文本

Question

*更新：我现在得到了Href链接。只需要搜索以获取每个项目之间的所有文本。

这是我的代码：1。获取开始和结束数据。

import requests
from bs4 import BeautifulSoup
import re
import urllib
new_text=urllib.request.urlopen("https://www.sec.gov/Archives/edgar/data/1294017/000119312505142547/0001193125-05-142547.txt") 
soup = BeautifulSoup(new_text, 'lxml')
results = soup.findAll("a", {"name" : True})
print(results)

所以我得到了这些：

<a name="toc"></a>, <a name="toc51579_1"></a>, <a name="toc51579_2"></a>,

获取每个起点和终点之间的文本。（这里我想制作一个循环来获取上面列表中的第一个和第二个项目，插入到re.search中并获取每个文本之间的所有文本。但我仍然坚持这一点。我不能让这个循环工作。我认为我在将第一个和第二个数据点插入re.search函数作为文本时犯了一个错误。 for en in enumerate（results）：new_text = re.search（r''+ re.escape（results [i]）+ re.escape（'。*？'）+ re.escape（results（i + 1）），汤，re.DOTALL）。group（）print（new_text）

原始问题：

假设我可以获得Anchor Href的链接，如何在文本中锚点Href的点之间提取文本？

基本上，我有

<A HREF="#toc51579_1">Summary</A>

和

<A HREF="#toc51579_2">Risk Factors</A>

我想跟随锚点href转到Summary页面将所有文本拉到Risk Factors页面。

如：从...开始

<A NAME="toc51579_1"></A>Summary </B></FONT></P>

风险因素

我的第一篇文章，请耐心等待。 :)

非常感谢你。

这是目录页面。我不需要这里的文字。它是显示锚点Hrefs所在的位置。

    <TR>
<TD WIDTH="88%"></TD>
<TD VALIGN="bottom" WIDTH="8%"></TD>
<TD></TD></TR>
<TR>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="center" STYLE="border-bottom:1px solid #000000"><FONT STYLE="font-family:Times New Roman" SIZE="1"><B>Page</B></FONT></TD></TR>
<TR>
<TD VALIGN="top"> <P STYLE="margin-left:1.00em; text-indent:-1.00em"><FONT STYLE="font-family:Times New Roman" SIZE="2"><A HREF="#toc51579_1">Summary</A></FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2">1</FONT></TD></TR>
<TR>
<TD VALIGN="top"> <P STYLE="margin-left:1.00em; text-indent:-1.00em"><FONT STYLE="font-family:Times New Roman" SIZE="2"><A HREF="#toc51579_2">Risk Factors</A></FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2">15</FONT></TD></TR>

Answer 1

你想要文本，而不是实际的href值是否正确？文本值在<a>标记内。所以做一个.find_all('a')。然后迭代这些元素，并获得文本，我们使用.text

html = '''    <TR>
<TD WIDTH="88%"></TD>
<TD VALIGN="bottom" WIDTH="8%"></TD>
<TD></TD></TR>
<TR>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="center" STYLE="border-bottom:1px solid #000000"><FONT STYLE="font-family:Times New Roman" SIZE="1"><B>Page</B></FONT></TD></TR>
<TR>
<TD VALIGN="top"> <P STYLE="margin-left:1.00em; text-indent:-1.00em"><FONT STYLE="font-family:Times New Roman" SIZE="2"><A HREF="#toc51579_1">Summary</A></FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2">1</FONT></TD></TR>
<TR>
<TD VALIGN="top"> <P STYLE="margin-left:1.00em; text-indent:-1.00em"><FONT STYLE="font-family:Times New Roman" SIZE="2"><A HREF="#toc51579_2">Risk Factors</A></FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="bottom" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2">15</FONT></TD></TR>'''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

alpha = soup.find_all('a')

for ele in alpha:
    print (ele.text)

输出：

Summary
Risk Factors

如果碰巧有其他<a>标签没有href，但你只想要那些href，你只需将它添加到你的find_all()

soup.find_all('a', href=True)

提取href目标锚文本

问题描述投票：0回答：1

1个回答

最新问题

提取href目标锚文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1