为什么部分内容会神奇地消失

问题描述 投票:0回答:1

我尝试抓取网站https://www.pik.ru/search/vangarden/storehouse,我成功地从网站获取了html并将其写入文件中,但是当我尝试获取html时很多信息丢失了。

示例: 我从页面中得到的内容(屏幕 1 不是全部)

当我尝试操作它时我得到了什么(屏幕2)

请帮助我做错了什么 (谢谢你!) 我的代码

import requests
from bs4 import BeautifulSoup
import undetected_chromedriver
import time
import os

url = 'https://www.pik.ru/search/storehouse'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}

proxies = {
    'https': 'http://146.247.105.71:4827'
}


def download_pages_objects(url):
    if os.path.isfile(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt') == True:
        os.remove(
            r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt')

    list_links = []
    req = requests.get(url, headers=headers, proxies=proxies)
    soup = BeautifulSoup(req.text, "lxml")

    for i in soup.find_all("a", class_="styles__ProjectCard-uyo9w7-0 friPgx"):
        list_links.append('https://www.pik.ru'+i.get('href')+'\n')

    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'a') as file:
        for link in list_links:
            file.write(link)


def get_list_objects_links(url):
    download_pages_objects(url)

    list_of_links = []
    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'r') as file:
        for item in file:
            list_of_links.append(item)

    return list_of_links


list_links = get_list_objects_links(url)

count = 0
for link in list_links:
    req = requests.get(link.replace('\n', ''),
                       headers=headers, proxies=proxies)

    with open('1.html', 'w') as file:
        file.write(req.text)

    soup = BeautifulSoup(req.text, 'lxml')
    print(req.text, '\n\n\n')
    print(soup.find_all('div', ''), '\n\n\n')

    with open('1.html', 'r') as file:
        scr = file.read()

    print(scr, '\n\n\n')
    soup = BeautifulSoup(scr, 'lxml')
    print(soup)

    count += 1
    if count == 1:
        break

我尝试在不写入文件的情况下操作它,还更改lxml,xml,html.parser - 它没有帮助(或者我做错了什么)

python parsing web-scraping html-parsing
1个回答
0
投票

JonSG 指出,您在浏览器中看到的内容是浏览器引擎执行 JavaScript 并动态修改页面的结果。 Python 的 BeautifulSoup 按原样获取页面的 contents

您正在寻找的是 Web 驱动程序,例如 Selenium: 使用 Selenium Webdriver 和 HtmlUnit 下载网页内容

© www.soinside.com 2019 - 2024. All rights reserved.