无法使用线程以正确的方式执行我的脚本

Question

我尝试使用python和Thread创建一个刮刀，以加快执行时间。刮刀应该解析所有商店名称以及遍历多个页面的电话号码。

脚本运行没有任何问题。由于我很高兴与Thread合作，我几乎无法理解我是以正确的方式做到这一点。

这是我到目前为止尝试过的：

import requests 
from lxml import html
import threading
from urllib.parse import urljoin 

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def get_information(url):
    for pagelink in [url.format(page) for page in range(20)]:
        response = requests.get(pagelink).text
        tree = html.fromstring(response)
        for title in tree.cssselect("div.info"):
            name = title.cssselect("a.business-name span[itemprop=name]")[0].text
            try:
                phone = title.cssselect("div[itemprop=telephone]")[0].text
            except Exception: phone = ""
            print(f'{name} {phone}')

thread = threading.Thread(target=get_information, args=(link,))

thread.start()
thread.join()

问题是，无论是使用Thread还是不使用Thread运行上述脚本，我都无法找到时间或性能上的任何差异。如果我出错了，我怎么能用Thread执行上面的脚本？

编辑：我试图改变逻辑使用多个链接。现在可以吗？提前致谢。

Answer 1

你可以使用线程在paralel中刮几页，如下所示：

import requests
from lxml import html
import threading
from urllib.parse import urljoin

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def get_information(url):
    response = requests.get(url).text
    tree = html.fromstring(response)
    for title in tree.cssselect("div.info"):
        name = title.cssselect("a.business-name span[itemprop=name]")[0].text
        try:
            phone = title.cssselect("div[itemprop=telephone]")[0].text
        except Exception: phone = ""
        print(f'{name} {phone}')

threads = []
for url in [link.format(page) for page in range(20)]:
    thread = threading.Thread(target=get_information, args=(url,))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()

请注意，不会保留数据序列。这意味着如果要逐页抓取提取的数据序列将是：

page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3

与线程数据将混合：

page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3

无法使用线程以正确的方式执行我的脚本

问题描述投票：2回答：1

1个回答

最新问题

无法使用线程以正确的方式执行我的脚本

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1