使用beautifulsoup4进行抓取时数据丢失

Question

实际上我是使用Python Beautifulsoup4解析东西的新手。我在刮this website。我需要首页上的当前每百万价格。

我已经用了3个小时。在互联网上寻找解决方案。我知道有一个库PyQT4可以模仿网络浏览器并加载内容，然后一旦完成加载，你就可以提取你需要的数据。但是我崩溃了。

使用此方法以原始文本格式收集数据。我也尝试了其他方法。

def parseMe(url):
    soup = getContent(url)
    source_code = requests.get(url)
    plaint_text = source_code.text
    soup = BeautifulSoup(plaint_text, 'html.parser')
    osrs_text = soup.find('div', class_='col-md-12 text-center')
    print(osrs_text.encode('utf-8'))

Please have a look on this image。我认为问题在于:: before和:: after标签。一旦页面加载，它们就会出现。任何帮助将受到高度赞赏。

Answer 1

您应该使用selenium而不是`requests：

from selenium import webdriver
from bs4 import BeautifulSoup

def parse(url):
    driver = webdriver.Chrome('D:\Programming\utilities\chromedriver.exe')
    driver.get('https://boglagold.com/buy-runescape-gold/')
    soup = BeautifulSoup(driver.page_source)
    return soup.find('h4', {'id': 'curr-price-per-mil-text'}).text

parse()

输出：

'Current Price Per Mil: 0.80USD'

原因是该元素的值是通过JavaScript获得的，requests无法处理。此特定代码段使用Chrome驱动程序;如果您愿意，可以使用Firefox /其他等效的浏览器（您需要安装selenium库并自行查找Chrome驱动程序）。

Answer 2

该网页使XHR以其中的价格获取JSON文件

import requests

r = requests.get('https://api.boglagold.com/api/product/?id=osrs-gold&couponCode=null')
j = r.json()
# print(j)
print('sellPrice', j['sellPrice'])
print('buyPrice', j['buyPrice'])

输出：

sellPrice 0.8
buyPrice 0.62

Answer 3

正如其他答案所述，此页面仅包含文本Current Price Per Mil:和0USD。中间的值 - 0.8 - 是从下面描述的URL动态获得的（可以获得using a process described (for example) here and many other places。该站点检查机器人，所以你有to use a method described (for example) here。

所以一起：

url = 'https://api.boglagold.com/api/product/?id=osrs-gold&couponCode=null'
import requests
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})

response.json()['sellPrice']

输出：

0.8

Answer 4

问题是，javascript会动态添加您要在该网站上废弃的数据。您可以尝试在客户端运行JS，等待获取要废弃的数据，然后获取DOM内容 - 如果您想这样做，请查看@gmds answer这个问题。另一种方法是检查javascript代码发出的请求以及哪一个包含您需要的信息。然后你可以使用python发出请求并获得所需的数据，而无需使用PyQT4甚至BS4。

使用beautifulsoup4进行抓取时数据丢失

问题描述投票：3回答：4

4个回答

最新问题

使用beautifulsoup4进行抓取时数据丢失

问题描述 投票：3回答：4

4个回答

最新问题

问题描述投票：3回答：4