如何从维基链接中提取数据?

问题描述 投票:0回答:1

我想从 mwparserfromhell 库返回的 wiki 链接中提取数据。 例如,我想解析以下字符串:

[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]

如果我使用字符

|
分割字符串,则它不起作用,因为图像描述中也有一个使用
|
的链接:
[[Maria Skłodowska-Curie Museum|Birthplace]]

我使用正则表达式首先替换字符串中的所有链接,然后再拆分它。它可以工作(在本例中),但感觉不干净(参见下面的代码)。有没有更好的方法从这样的字符串中提取信息?

import re

wiki_code = "[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]"

# Remove [[File: at the begining of the string
prefix = "[[File:"
if (wiki_code.startswith(prefix)):
    wiki_code = wiki_code[len(prefix):]

# Remove ]] at the end of the string
suffix = "]]"
if (wiki_code.endswith(suffix)):
    wiki_code = wiki_code[:-len(suffix)]

# Replace links with their
link_pattern = re.compile(r'\[\[.*?\]\]')
matches = link_pattern.findall(wiki_code)
for match in matches:
    content = match[2:-2]
    arr = content.split("|")
    label = arr[-1]
    wiki_code = wiki_code.replace(match, label)

print(wiki_code.split("|"))
python wikipedia
1个回答
1
投票

.filter_wikilinks()
返回的链接是
Wikilink
类的实例,它具有
title
text
属性。

  • title
    返回链接的标题:
    File:Warszawa, ul. Freta 16 20170516 002.jpg
  • text
    返回链接的其余部分:
    thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].

这些将作为

Wikicode
对象返回。

由于实际文本始终是最后一个片段,因此首先需要使用以下正则表达式查找其他片段:

([^\[\]|]*\|)+

  • (
    )
    :一组
    • [^\[\]|]*
      :0 个或多个非方括号或竖线的字符
    • \|
      :文字管道
  • +
    :1个或以上

从最后一个匹配的结束索引到字符串末尾的所有其他内容都是最后一个片段。

>>> import mwparserfromhell
>>> import re
>>> wikitext = mwparserfromhell.parse('[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]')
>>> image_link = wikitext.filter_wikilinks()[0]
>>> image_link
'[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]'
>>> image_link.title
'File:Warszawa, ul. Freta 16 20170516 002.jpg'
>>> text = str(image_link.text)
>>> text
'thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'
>>> other_fragments = re.match(r'([^\[\]|]*\|)+', text)
>>> other_fragments
<re.Match object; span=(0, 19), match='thumb|upright=1.18|'>
>>> other_fragments.span(0)[1]
19
>>> text[19:]
'[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'

当标题不是最后一个片段时

对于这种边缘情况,我们可以使用

text
函数再次解析
itertools
属性:

>>> import mwparserfromhell
>>> import re
>>> from itertools import chain, groupby
>>> wikitext = mwparserfromhell.parse('[[File:Marie Curie - Mobile X-Ray-Unit.jpg|thumb|Curie in a mobile X-ray vehicle, {{circa|1915}}|alt=]]')
>>> image_link = wikitext.filter_wikilinks()[0]
>>> image_link.text
'thumb|Curie in a mobile X-ray vehicle, {{circa|1915}}|alt='
>>> child_nodes = image_link.text.filter(recursive = False)
>>> child_nodes
['thumb|Curie in a mobile X-ray vehicle, ', '{{circa|1915}}', '|alt=']
>>> isinstance(child_nodes[0], mwparserfromhell.nodes.Text)
True
>>> isinstance(child_nodes[1], mwparserfromhell.nodes.Template)
True
>>> tokens = list(chain.from_iterable(re.split(r'(\|)', str(node)) if isinstance(node, mwparserfromhell.nodes.Text) else [node] for node in child_nodes))
>>> tokens
['thumb', '|', 'Curie in a mobile X-ray vehicle, ', '{{circa|1915}}', '', '|', 'alt=']
>>> fragments = []
>>> for is_not_pipe, group in groupby(tokens, key = lambda token: token != '|'):
...   if is_not_pipe:
...     fragments.append(''.join(map(str, group)))
...
>>> fragments
['thumb', 'Curie in a mobile X-ray vehicle, {{circa|1915}}', 'alt=']
© www.soinside.com 2019 - 2024. All rights reserved.