我尝试使用下面的代码来获取锚标记
href
和以下 URL https://www.tradeindia.com/
的嵌套锚标记值,但不会生成确切的输出。下面的代码只能得到单页 URL 输出,有人可以建议吗?
import concurrent
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
def get_page(url):
response = requests.get(url)
return response.content
def extract_links(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
links = [a['href'] for a in soup.find_all('a', href=True)]
return links
def process_page(url):
html_content = get_page(url)
links = extract_links(html_content)
return links
def main():
start_url = 'https://www.tradeindia.com/'
# Fetch the initial page
start_page_content = get_page(start_url)
# Extract links from the initial page
start_page_links = extract_links(start_page_content)
all_links = set(start_page_links)
# Use ThreadPoolExecutor to parallelize the process
with ThreadPoolExecutor(max_workers=5) as executor:
# Submit tasks for processing each link concurrently
future_to_url = {executor.submit(process_page, url): url for url in start_page_links}
# Iterate through completed tasks and update the set of all links
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
links_on_page = future.result()
all_links.update(links_on_page)
except Exception as e:
print(f"Error processing {url}: {e}")
# Print all the extracted links
print("All Links:")
print(len(all_links))
for link in all_links:
print(link)
if __name__ == "__main__":
main()
您从起始页获取网址并将其添加到
all_links
,但这不会使用新网址运行executor.submit()
。为此可能需要更复杂的代码。
坦率地说,我宁愿使用
scrapy
,因为它已经使用了threading
并且代码会简单得多。
通常
scrapy
需要生成包含许多文件和文件夹的 project
,但您可以在不创建项目的情况下运行此代码。您可以将所有内容放入一个文件中并像任何其他脚本一样运行 - python script.py
此脚本还会自动将结果写入文件
.csv
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['tradeindia.com']
start_urls = ['https://www.tradeindia.com/']
def parse(self, response):
print('\n>>> url:', response.url, '\n')
links = response.css('a::attr(href)').extract()
# create items which it will save in file `CSV`
for url in links:
yield {'url': url}
# create requests with URL so it will process next pages
for url in links:
yield response.follow(url)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()