Lxml获取所有项目，但也要测试下一个-Python

Question

我在尝试解析此lxml时遇到麻烦。我正在使用python语言3.6.9。

是这样的。

<download date="22/05/2020 08:34">
    <link url="http://xpto" document="y"/>
    <link url="http://xpto" document="y"/>
    <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>
    <link url="http://xpto" document="z"/>
    <subjects number="1"><subject>Text explaining the previous link</subject></subjects>
    <link url="http://xpto" document="y"/>
    <link url="http://xpto" document="z"/>
</download>

当前，我可以使用此功能获取所有链接（这很容易实现）：

import requests
from lxml import html 
response = html.fromstring(requests.post(url_post, data=data).content)
links = response.xpath('//link')

正如我在lxml中所指出的那样，这些主题（如果存在）旨在说明前面的链接。有时，它可以包含一个以上的主题（就像上面的示例一样，其中一个主题的编号为2，这意味着它内部有两个“主题”项目，而另一个“主题”只有一个主题）。这是一个很大的lxml文件，因此这种差异（很多链接，直到它只有一个链接，之后会有一个解释）经常发生。

我如何构建查询以获取所有这些链接，并且在其旁边存在主题时（更精确地说，在链接之后），将其放在一起或将其也插入链接中？

我的梦想是这样的：

<link url="http://xpto" document="y" subjects="Text explaining the previous link| Another text explaining the thing"/>

同时包含链接和主题的列表也将大有帮助。

[
[<link url="http://xpto" document="y"/>], 
[<link url="http://xpto" document="y"/>, <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>],
[<link url="http://xpto" document="y"/>], 
]

请，当然可以提出不同的建议。

谢谢，伙计们！

Answer 1

这确实满足您的需求：

from lxml import html

example = """
<link url="some_url" document="a"/>
<link url="some_url" document="b"/>
<subjects><subject>some text</subject></subjects>
<link url="some_url" document="c"/>
<link url="some_url" document="d"/>
<subjects><subject>some text</subject><subject>some more</subject></subjects>
"""

response = html.fromstring(example)
links = response.xpath('//link')
result = []
for link in links:
    result.append([link])
    next_element = link.getnext()
    if next_element is not None and next_element.tag == 'subjects':
        result[-1].append(next_element)

print(result)

结果：

[[<Element link at 0x1a0891e0d60>], [<Element link at 0x1a0891e0db0>, <Element subjects at 0x1a089096360>], [<Element link at 0x1a0891e0e00>], [<Element link at 0x1a0891e0e50>, <Element subjects at 0x1a0891e0d10>]]

请注意，列表仍然包含lxml Element对象，您当然可以将它们转换为字符串。

Lxml获取所有项目，但也要测试下一个-Python

问题描述投票：0回答：1

1个回答

最新问题

Lxml获取所有项目，但也要测试下一个-Python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1