遍历BeautifulSoup列表，并将其解析为HTML标记和数据问题

Question

Python 3程序员，BeautifulSoup和HTMLParser的新手。我正在使用BeautifulSoup从HTML文件中获取所有定义列表数据，并尝试将dt数据和dd数据分别作为键值对存储到python字典中。我的HTML文件（List_page.html）是：

<!DOCTYPE html>
<html lang="en">
<head>STH here</head>
<body>
    <!--some irrelavent things here-->
    <dl class="key_value">
        <dt>Sine</dt>
        <dd>The ratio of the length of the opposite side to the length of the hypotenuse.</dd>
        <dt>Cosine</dt>
        <dd>The ratio of the length of the adjacent side to the length of the hypotenuse.</dd>
    </dl>
    <!--some irrelavent things here-->    
</body>
</html>

而我的Python代码为：

from bs4 import BeautifulSoup
from html.parser import HTMLParser

dt = []
dd = []
dl = {}

class DTParser(HTMLParser):
    def handle_data(self, data):
        dt.append(data)

class DDParser(HTMLParser):
    def handle_data(self, data):
        dd.append(data)

html_page = open("List_page.html")
soup = BeautifulSoup(html_page, features="lxml")

dts = soup.select("dt")
parser = DTParser()

# Start of part 1:
parser.feed(str(dts[0]).replace('\n', ''))
parser.feed(str(dts[1]).replace('\n', ''))
# end of part 1

dds = soup.select("dd")
parser = DDParser()

# Start of part 2
parser.feed(str(dds[0]).replace('\n', ''))
parser.feed(str(dds[1]).replace('\n', ''))
# End of part 2

dl = dict(zip(dt, dd))
print(dl)

输出为：

这将按预期正确输出内容。但是，当我用for循环替换第1部分（或第2部分）时，它开始出错：

例如，代码：

# Similar change for part 2
for dt in dts:
    parser.feed(str(dts[0]).replace('\n', ''))

在这种情况下仅告诉我余弦的定义，而不是正弦。有2个项目，我可以不做循环。但是，如果我有更多物品怎么办？因此，想知道执行此操作的正确方法。谢谢。

Answer 1

您将使用dts[0]在每次迭代中获取for循环中dts的第一个元素，而不是通过循环来更新索引。更改为：

for i in range(len(dts)):
    parser.feed(str(dts[i]).replace('\n', ''))

和

for i in range(len(dds)):
    parser.feed(str(dds[i]).replace('\n', ''))

遍历BeautifulSoup列表，并将其解析为HTML标记和数据问题

问题描述投票：0回答：1

1个回答

最新问题

遍历BeautifulSoup列表，并将其解析为HTML标记和数据问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1