维基百科信息框内容

问题描述 投票:0回答:5

我需要获取任何电影的信息框的内容。我知道电影的名字。一种方法是获取维基百科页面的完整内容,然后对其进行解析,直到找到

{{Infobox
,然后获取信息框的内容。

是否有其他方法可以使用某些 API 或解析器来实现相同的目的?

我正在使用 Python 和 pywikipediabot API。

我也熟悉wikitools API。因此,如果有人有与 wikitools API 相关的解决方案,请不要提及 pywikipedia。

python mediawiki wikipedia pywikibot
5个回答
11
投票

另一个很棒的 MediaWiki 解析器是 mwparserfromhell

In [1]: import mwparserfromhell

In [2]: import pywikibot

In [3]: enwp = pywikibot.Site('en','wikipedia')

In [4]: page = pywikibot.Page(enwp, 'Waking Life')            

In [5]: wikitext = page.get()               

In [6]: wikicode = mwparserfromhell.parse(wikitext)

In [7]: templates = wikicode.filter_templates()

In [8]: templates?
Type:       list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name           = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length:     31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items

In [10]: templates[:2]
Out[10]: 
[u'{{Use mdy dates|date=September 2012}}',
 u"{{Infobox film\n| name           = Waking Life\n| image          = Waking-Life-Poster.jpg\n| image_size     = 220px\n| alt            =\n| caption        = Theatrical release poster\n| director       = [[Richard Linklater]]\n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer         = Richard Linklater\n| starring       = [[Wiley Wiggins]]\n| music          = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing        = Sandra Adair\n| studio         = [[Thousand Words]]\n| distributor    = [[Fox Searchlight Pictures]]\n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country        = United States\n| language       = English\n| budget         =\n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]

In [11]: infobox_film = templates[1]

In [12]: for param in infobox_film.params:
             print param.name, param.value

 name             Waking Life

 image            Waking-Life-Poster.jpg

 image_size       220px

 alt             

 caption          Theatrical release poster

 director         [[Richard Linklater]]

 producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West

 writer           Richard Linklater

 starring         [[Wiley Wiggins]]

 music            Glover Gill

 cinematography   Richard Linklater<br />[[Tommy Pallotta]]

 editing          Sandra Adair

 studio           [[Thousand Words]]

 distributor      [[Fox Searchlight Pictures]]

 released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}

 runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>

 country          United States

 language         English

 budget          

 gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>

不要忘记 params 也是 mwparserfromhell 对象!


6
投票

与其重新发明轮子,不如查看 DBPedia,它已经将所有维基百科信息框提取为易于解析的数据库格式。


2
投票

任何信息框都是由大括号嵌入的模板。让我们看一下模板以及它是如何嵌入到维基文本中的:

信息框影片

{{Infobox film
| name           = Actresses
| image          = Actrius film poster.jpg
| alt            = 
| caption        = Catalan language film poster
| native_name      = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director       = [[Ventura Pons]]
| producer       = Ventura Pons
| writer         = [[Josep Maria Benet i Jornet]]
| screenplay     = Ventura Pons
| story          = 
| based_on       = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring       = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator       = <!-- or: |narrators = -->
| music          = Carles Cases
| cinematography = Tomàs Pladevall
| editing        = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor    = [[Buena Vista International]]
| released       = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime        = 100 minutes
| country        = Spain
| language       = Catalan
| budget         = 
| gross          = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}

Pywikibot 中有两个高级

Page
方法来解析 wiki 文本内容中任何模板的内容。如果已安装,则两者都使用
mwparserfromhell
;否则使用正则表达式,但对于深度 > 3 的嵌套模板,正则表达式可能会失败:

raw_extracted_templates

raw_extracted_templates
是一个
Page
属性,返回一个元组列表,每个元组包含两个项目。第一项是模板标识符,例如 str,
'Infobox film'
。第二项是一个 OrderedDict,其模板参数标识符作为键,其分配作为值。例如模板字段

| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster

OrderedDict 的结果为

OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')

现在如何使用 Pywikibot 获取它?

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
    if tmpl == 'Infobox film':
        pprint(params)

这将打印

 OrderedDict([('name', 'Actresses'),
              ('image', 'Actrius film poster.jpg'),
              ('alt', ''),
              ('caption', 'Catalan language film poster'),
              ('native_name',
               "([[Catalan language|Catalan]]: '''''Actrius''''')"),
              ('director', '[[Ventura Pons]]'),
              ('producer', 'Ventura Pons'),
              ('writer', '[[Josep Maria Benet i Jornet]]'),
              ('screenplay', 'Ventura Pons'),
              ('story', ''),
              ('based_on',
               "{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
              ('starring',
               '{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
               'Lizaran]]|[[Mercè Pons]]}}'),
              ('narrator', ''),
              ('music', 'Carles Cases'),
              ('cinematography', 'Tomàs Pladevall'),
              ('editing', 'Pere Abadal'),
              ('production_companies',
               '{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
               'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
               'Departament de Cultura]]|[[Televisión Española]]}}'),
              ('distributor', '[[Buena Vista International]]'),
              ('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
              ('runtime', '100 minutes'),
              ('country', 'Spain'),
              ('language', 'Catalan'),
              ('budget', ''),
              ('gross', '')])

templatesWithParams()

这与 raw_extracted_templates 属性类似,但该方法返回一个元组列表,其中又包含两个项目。第一项是作为

Page
对象的模板。第二项是模板参数列表。看看示例:

示例代码

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
    if tmpl.title(with_ns=False) == 'Infobox film':
        pprint(tmpl)

这将打印列表:

['alt=',
 "based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
 'budget=',
 'caption=Catalan language film poster',
 'cinematography=Tomàs Pladevall',
 'country=Spain',
 'director=[[Ventura Pons]]',
 'distributor=[[Buena Vista International]]',
 'editing=Pere Abadal',
 'gross=',
 'image=Actrius film poster.jpg',
 'language=Catalan',
 'music=Carles Cases',
 'name=Actresses',
 'narrator=',
 "native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
 'producer=Ventura Pons',
 'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
 'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
 'Cultura]]|[[Televisión Española]]}}',
 'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
 'runtime=100 minutes',
 'screenplay=Ventura Pons',
 'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
 'Lizaran]]|[[Mercè Pons]]}}',
 'story=',
 'writer=[[Josep Maria Benet i Jornet]]']

0
投票

您可以使用 pywikipdiabot 获取 wiki 页面内容,然后,您可以使用正则表达式、mwlib [0] 等解析器搜索信息框,甚至坚持使用 pywikipediabot 并使用他的模板工具之一。例如,在 textlib 上,您会找到一些处理模板的函数(提示:搜索“# Functions Handling with templates”)。 [1]

[0] - http://pypi.python.org/pypi/mwlib

[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup


0
投票

维基媒体企业有一个新的测试版,它将信息框的所有内容放入休息 API https://enterprise.wikimedia.com/news/structed-contents-wikipedia-infobox/

© www.soinside.com 2019 - 2024. All rights reserved.