WebScraping Aliexpress - 延迟加载

问题描述 投票:0回答:2

我正在尝试使用 Selenium 和 Python 来抓取 Aliexpress。我是按照 YouTube 教程来做的,我已经遵循了每一个步骤,但我似乎无法让它工作。

我也尝试使用请求,BeautifulSoup。但 Aliexpress 似乎在其产品列表中使用了惰性加载程序。我尝试使用窗口滚动脚本,但这不起作用。看起来内容只有我亲自滚动才能加载。

这是我想要抓取的页面的 URL https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=dog+supplies<ype=wholesale&SortType=default&g=n

这是我目前拥有的代码。它不会在输出中返回任何内容。我认为这是因为它试图浏览所有产品列表,但找不到任何产品,因为它尚未加载......

任何建议/帮助将不胜感激,对于错误的格式和错误的代码提前表示歉意。

谢谢!

"""
To do
HOT PRODUCT FINDER Enter: Keyword, to generate a url

Product Name
Product Image
Product Link
Sales Number
Price
Create an excel file that contains these data
Sort the list by top selling orders
Develop an algorithm for the velocity of the product (total sales increased / time?)
Scrape site every day """

import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
import lxml

#Starting Up the web driver
driver = webdriver.Chrome()

# grab Keywords
search_term = input('Keywords: ')

# url generator

def get_url(search_term):
    """Generate a url link using search term provided"""
    url_template = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText={}&ltype=wholesale&SortType=default&g=n'
    search_term = search_term.replace(" ", "+")
    return url_template.format(search_term)

url = get_url('search_term')
driver.get(url)

#scrolling down to the end of the page
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

#Extracting the Collection
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
productlist = soup.find_all('div', class_='list product-card')
print(productlist)

python selenium beautifulsoup
2个回答
0
投票
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import requests
import lxml
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('--disable-blink-features=AutomationControlled') 

driver = webdriver.Chrome(executable_path = 'chromedriver.exe',options = chrome_options)
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

# grab Keywords
search_term = input('Keywords: ')

# url generator
driver.get('https://www.aliexpress.com')
driver.implicitly_wait(10)


p = driver.find_element_by_name('SearchText')
p.send_keys(search_term)
p.send_keys(Keys.ENTER)

productlist = []
product = driver.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div[2]/div/div[2]/ul')

height = driver.execute_script("return document.body.scrollHeight")
for scrol in range(100,height-1800,100):
    driver.execute_script(f"window.scrollTo(0,{scrol})")
    time.sleep(0.5)
# driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
div = []
list_i = []
item_title = []
a = []
for z in range(1,16):                     
    div.append(product.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div[2]/div/div[2]/ul/div'+str([z])))
for pr in div:
    list_i.append(pr.find_elements_by_class_name('list-item'))
for pc in list_i:
    for p in pc:
        item_title.append(p.find_element_by_class_name('item-title-wrap'))
for pt in item_title:
    a.append(pt.find_element_by_tag_name('a'))
for prt in a:
    productlist.append(prt.text)

0
投票

速卖通使用延迟加载来加载产品列表。但您可以将 POST 请求发送到 https://www.aliexpress.com/fn/search-pc/index 它以 json 格式输出完整的产品列表,包括价格、图像链接等。我通过这种方式从速卖通抓取了大部分目录。

© www.soinside.com 2019 - 2024. All rights reserved.