input()函数的存在会导致多处理失败。这背后的原因是什么?

问题描述 投票:1回答:1

我有一个简单的Web抓取脚本,它使用多处理。我希望用户选择要删除哪个excel文件,因此在开头使用input()

如果没有多处理代码,脚本运行正常(尽管它一次处理一个链接)。使用多处理代码,脚本无限期挂起。即使我不在脚本中使用从input()收集的字符串也是如此,所以看起来只是input()的存在导致脚本在多处理存在的情况下挂起。

我不知道为什么会这样。任何见解都非常感谢。

代码:

os.chdir(os.path.curdir)

# excel_file_name_b is not used in the script at all, but because
# it exists, the script hangs. Ideally I want to keep input() in the script
excel_file_name_b = input()
excel_file_name = "URLs.xlsx"

excel_file = openpyxl.load_workbook(excel_file_name)
active_sheet = excel_file.active
rows = active_sheet.max_row

for i in range(2,rows+1,1):
    list.append(active_sheet.cell(row=i,column=1).value)

headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}

def scrape(url):
    try:
        res = get(url, headers = headers)
        html_soup = BeautifulSoup(res.text, 'lxml')
        html_element = html_soup.select('._3pvwV0k')
        return res.url, html_element[0].getText()
    except:
        return res.url, "Not found or error"
        pass

if __name__ == '__main__':
    p = Pool(10)
    scrape_return = p.map(scrape, list)
    for k in range(len(scrape_return)):
        try:
            active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
            active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
        except:
            continue

excel_file.save(excel_file_name)
python multiprocessing
1个回答
2
投票

因为您的input()处于模块级别,所以每个进程都会调用它以使其可供进程使用。

多处理关闭stdin,这是导致每个错误的原因。 [docs]

如果你把它移到if __name__ == '__main__':你不应该再有问题了。

编辑:重新格式化您的代码更类似于下面可能会清除其他问题,因为它没有按预期执行。


def scrape(url):
    headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}
    try:
        res = get(url, headers=headers)
        html_soup = BeautifulSoup(res.text, 'lxml')
        html_element = html_soup.select('._3pvwV0k')
        return res.url, html_element[0].getText()
    except:
        return res.url, "Not found or error"
        pass


def main():
    excel_file_name_b = input()
    excel_file_name = "URLs.xlsx"
    excel_file = openpyxl.load_workbook(excel_file_name)
    active_sheet = excel_file.active
    rows = active_sheet.max_row

    for i in range(2,rows+1,1):
        list.append(active_sheet.cell(row=i,column=1).value)   # rename this object, list is a keyword



    p = Pool(10)
    scrape_return = p.map(scrape, list)   # rename here too
    for k in range(len(scrape_return)):
        try:
            active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
            active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
        except:
            continue

    excel_file.save(excel_file_name)

if __name__ == '__main__':
    main()
© www.soinside.com 2019 - 2024. All rights reserved.