如何获取Wikiprojects的维基百科数据?

问题描述 投票:1回答:2

我最近发现维基百科有基于Wikiprojectsdiscipline)分类的https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline。如链接所示,它有34个学科。

我想知道是否有可能获得与这些wikipedia disciplines相关的所有维基百科文章。

例如,考虑WikiProject Computer science‎。是否有可能使用WikiProject Computer science‎类别获取所有与计算机科学相关的维基百科文章?如果是,是否有与之相关的数据转储或是否有其他方法来获取这些数据?

我目前正在使用python(即pywikibotpymediawiki)。但是,我很高兴收到其他语言的答案。

如果需要,我很乐意提供更多细节。

python mediawiki wikipedia wikipedia-api mediawiki-api
2个回答
3
投票

正如我建议并添加@ arash的答案,您可以使用Wikipedia API获取维基百科数据。这是与如何做到这一点的描述的链接,API:Categorymembers#GET_request

当您评论需要使用程序获取数据时,下面是JavaScript中的示例代码。它将从Category:WikiProject_Computer_science_articles获取前500个名称并显示为输出。您可以根据以下示例转换您选择的语言:

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        console.log(t.query.categorymembers[i].title);
    }
});

要将数据写入文件,您可以执行以下操作:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = [];
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles[i] = title;
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

上面的数据会将数据存储在,分开的文件中,因为我们在那里使用了JavaScript数组。如果你想在没有逗号的每一行存储,那么你需要这样做:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = '';
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles += title + "\n";
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

通过使用cmlimit,我们无法获取超过500个标题,因此我们需要使用cmcontinue来检查和获取下一页...

尝试下面的代码,它获取特定类别的所有标题并打印,将数据附加到文件:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file 
var fetchTheData = async (url, index) => {
    return await fetch(url).then(res => res.json()).then(data => {
        // Getting the length of the returned array
        let len = data.query.categorymembers.length;
        // Initializing an empty string
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = data.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        // Appending to the file
        fs.appendFileSync('pathtotitles\\titles.txt', titles);
        // Handling an end of error fetching titles exception
        try {
            return data.continue.cmcontinue;
        } catch(err) {
            return "===>>> Finished Fetching...";
        }
    });
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
    // Getting the next page token
    let nextPage = await fetchTheData(url);
    for(let i=1;i<=14;i++) {
        await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
        // Constructing the next page URL with next page token and sending the fetch request
        nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
    }
}

// Calling to begin extraction
constructNextPageURL(url);

我希望它有所帮助......


2
投票

您可以使用API:Categorymembers获取子类别和页面列表。将“cmtype”参数设置为“subcat”以获取子类别,将“cmnamespace”设置为“0”以获取文章。

您还可以从数据库中获取列表(categorylinks table中的类别层次结构信息和page table中的文章信息)

© www.soinside.com 2019 - 2024. All rights reserved.