如何从维基百科中获取所有标题的JSON [关闭]

Question

我想了解如何才能获得所有维基百科页面的完整标题列表。我发现了类似的问题，但所有这些问题都建议使用我不知道如何处理的“转储”文件。

我只需要标题。

提前感谢您的支持

Answer 1

正如评论中所建议的那样，你应该使用Wikipedia api，特别是Allpages。要获得“全部”（不确定这是否可行，请查看apnamespace api args）来自a-z的维基百科标题，这里是针对此问题的快速线程脚本：

from time import sleep
import threading, requests, string

all_titles = {} # will hold the final results

def parse_letter(l):
    j_obj = requests.get(f"https://en.wikipedia.org/w/api.php?action=query&list=allpages&aplimit=1000&apfrom={l}&format=json").json()
    try:
        for p in j_obj['query']['allpages']:
            try:
                all_titles[p['pageid']] = p['title'] # append to final dictionary
                print(p['pageid'], p['title'])
            except:
                pass
    except Exception as e:
        pass
        print(f"Error letter {l}", e)

#  loop all letters from a to z.
for l in string.ascii_lowercase: # abcdefghijklmnopqrstuvwxyz
    # start threads
    threading.Thread(target=parse_letter, args=[l]).start()

# wait threads to finish
while threading.active_count() > 1:
    sleep(.2)

from pprint import pprint
pprint(all_titles)

'''
To export a json file, use:
import json
with open("all_titles.json", "w") as f:
     f.write(json.dumps(all_titles))
'''

输出（pageid：title）：

{290: 'A',
 4666: 'B*-algebra',
 27084: "B'Elanna Torres",
 76365: 'B-17',
 77818: "B'nai Noach",
 92281: "B'alam Quitzé",
 92282: "B'alam Quitze",
 92283: "B'alam Agab",
...

笔记：

您可以尝试将aplimit=1000更改为更高的值（未经测试）。
要过滤所有重定向页面，请使用gapfilterredir=nonredirects
阅读Wikipedia api的Allpages文档
Demo

如何从维基百科中获取所有标题的JSON [关闭]

问题描述投票：0回答：1

1个回答

最新问题

如何从维基百科中获取所有标题的JSON [关闭]

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1