getchildren() 到 LXML

问题描述 投票:0回答:1

我必须转换一些漂亮的汤代码。 基本上我想要的只是获取主体节点的所有子节点并选择其中包含文本并存储它们。 这是 bs4 的代码:

def get_children(self, tag, dorecursive=False):

    children = []
    if not tag :
        return children

    for t in tag.findChildren(recursive=dorecursive):
        if t.name in self.text_containers \
                and len(t.text) > self.min_text_length \
                and self.is_valid_tag(t):
            children.append(t)
    return children

这个效果很好 当我用 lxml lib 尝试这个时,孩子是空的:

def get_children(self, tag, dorecursive=False):

    children = []
    if not tag :
        return children

    tags = tag.getchildren()
    for t in tags:
        #print(t.tag)
        if t.tag in self.text_containers \
                and len(t.tail) > self.min_text_length \
                and self.is_valid_tag(t):
            children.append(t)
    return children

有什么想法吗?

beautifulsoup lxml
1个回答
1
投票

代码:

import lxml.html
import requests


class TextTagManager:
    TEXT_CONTAINERS = {
        'li',
        'p',
        'span',
        *[f'h{i}' for i in range(1, 6)]
    }
    MIN_TEXT_LENGTH = 60

    def is_valid_tag(self, tag):
        # put some logic here
        return True

    def get_children(self, tag, recursive=False):
        children = []

        tags = tag.findall('.//*' if recursive else '*')
        for t in tags:
            if (t.tag in self.TEXT_CONTAINERS and
                    t.text and
                    len(t.text) > self.MIN_TEXT_LENGTH and
                    self.is_valid_tag(t)):
                children.append(t)
        return children


manager = TextTagManager()

url = 'https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers'
html = requests.get(url).text
doc = lxml.html.fromstring(html)

for child in manager.get_children(doc, recursive=True):
    print(child.tag, ' -> ', child.text)

输出:

li  ->  HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: 
li  ->  HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: 

.getchildren()
返回所有 direct 子级。如果你想有一个递归选项,你可以使用
.findall()
:

tags = tag.findall('.//*' if recursive else '*')

这个答案应该可以帮助您理解

.//tag
tag
之间的区别。

© www.soinside.com 2019 - 2024. All rights reserved.