使用多处理功能，通过beautifulsoup改善Wikipedia抓取程度

Question

我正在使用beautifulsoup从一堆Wikipedia页面上抓取一些基本信息。该程序运行，但是很慢（650页大约20分钟）。我正在尝试使用多处理来加速此过程，但是它没有按预期工作。它似乎被拖延了，什么也不做，或者只刮了每页名称的第一个字母。

我使用的抓取代码是：

#dict where key is person's name and value is proper wikipedia url formatting
all_wikis = { 'Adam Ferrara': 'Adam_Ferrara',
              'Adam Hartle': 'Adam_Hartle',
              'Adam Ray': 'Adam_Ray_(comedian)',
              'Adam Sandler': 'Adam_Sandler',
              'Adele Givens': 'Adele_Givens'}
bios = {}
def scrape(dictionary):
    for key in dictionary:
        #search each page
        page = requests.get(("https://en.wikipedia.org/wiki/" + str(key)))
        data = page.text
        soup = BeautifulSoup(data, "html.parser")
        #get data
        try:
            bday = soup.find('span', attrs={'class' : 'bday'}).text
        except:
            bday = 'Birthday Unknown'
        try:
            birthplace = soup.find('div', attrs={'class' : 'birthplace'}).text
        except:
            birthplace = 'Birthplace Unknown'
        try:
            death_date = (soup.find('span', attrs={'style' : "display:none"}).text
                                                                            .replace("(", "")
                                                                            .replace(")", ""))
            living_status = 'Deceased'
        except:
            living_status = 'Alive'
        try:
            summary = wikipedia.summary(dictionary[key].replace("_", " "))
        except:
            summary = "No Summary"
        bios[key] = {}
        bios[key]['birthday'] = bday
        bios[key]['home_town'] = birthplace
        bios[key]['summary'] = summary
        bios[key]['living_status'] = living_status
        bios[key]['passed_away'] = death_date

我尝试使用下面的代码在脚本末尾添加处理功能，但是它不起作用，或者仅拉出每页的第一个字母（例如，如果我要搜索的页面是李小龙，相反，它将拉起维基百科页面上的字母B，然后抛出一堆错误。）

from multiprocessing import Pool, cpu_count

if __name__ == '__main__':
    pool = Pool(cpu_count())
    results = pool.map(func=scrape, iterable=all_wiki)
    pool.close()
    pool.join()
是否有更好的方法来构造脚本以进行多处理？谢谢

我正在使用beautifulsoup从一堆Wikipedia页面上抓取一些基本信息。该程序运行，但是很慢（650页大约20分钟）。我正在尝试使用多重处理来...

Answer 1

0
投票

这里有一些问题：

使用多处理功能，通过beautifulsoup改善Wikipedia抓取程度

问题描述投票：1回答：1

1个回答

最新问题

使用多处理功能，通过beautifulsoup改善Wikipedia抓取程度

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1