我尝试抓取网站https://www.pik.ru/search/vangarden/storehouse,我成功地从网站获取了html并将其写入文件中,但是当我尝试获取html时很多信息丢失了。
请帮助我做错了什么 (谢谢你!) 我的代码
import requests
from bs4 import BeautifulSoup
import undetected_chromedriver
import time
import os
url = 'https://www.pik.ru/search/storehouse'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}
proxies = {
'https': 'http://146.247.105.71:4827'
}
def download_pages_objects(url):
if os.path.isfile(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt') == True:
os.remove(
r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt')
list_links = []
req = requests.get(url, headers=headers, proxies=proxies)
soup = BeautifulSoup(req.text, "lxml")
for i in soup.find_all("a", class_="styles__ProjectCard-uyo9w7-0 friPgx"):
list_links.append('https://www.pik.ru'+i.get('href')+'\n')
with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'a') as file:
for link in list_links:
file.write(link)
def get_list_objects_links(url):
download_pages_objects(url)
list_of_links = []
with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'r') as file:
for item in file:
list_of_links.append(item)
return list_of_links
list_links = get_list_objects_links(url)
count = 0
for link in list_links:
req = requests.get(link.replace('\n', ''),
headers=headers, proxies=proxies)
with open('1.html', 'w') as file:
file.write(req.text)
soup = BeautifulSoup(req.text, 'lxml')
print(req.text, '\n\n\n')
print(soup.find_all('div', ''), '\n\n\n')
with open('1.html', 'r') as file:
scr = file.read()
print(scr, '\n\n\n')
soup = BeautifulSoup(scr, 'lxml')
print(soup)
count += 1
if count == 1:
break
我尝试在不写入文件的情况下操作它,还更改lxml,xml,html.parser - 它没有帮助(或者我做错了什么)
JonSG 指出,您在浏览器中看到的内容是浏览器引擎执行 JavaScript 并动态修改页面的结果。 Python 的 BeautifulSoup 按原样获取页面的 contents。
您正在寻找的是 Web 驱动程序,例如 Selenium: 使用 Selenium Webdriver 和 HtmlUnit 下载网页内容