使用Nokogiri从网站抓取数据后出现间距和重复的问题

问题描述 投票:0回答:1

从2个站点抓取数据。第一个站点刮擦其他站点,然后将价格重复两次。第二个站点抓取了正确的数据,但是返回了一个非常模糊的间距问题,即我不确定如何解决。这两个问题都需要审查。

class DailyDealz::Deal
attr_accessor :name, :price, :availability, :url

def self.today
 # Scrape woot and meh and then return deals based on that data
 self.scrape_deals
end

def self.scrape_deals
    deals = []

    deals << self.scrape_woot
    deals << self.scrape_meh
    # deals << self.scrape_steepandcheap

    deals
end

def self.scrape_woot
    doc = Nokogiri::HTML(open("https://www.woot.com/"))

    deal = self.new
    deal.name = doc.search("h2.main-title").text.strip
    deal.price = doc.search("#todays-deal span.price").text.strip
    deal.url = doc.search("a.wantone").first.attr("href").strip
    deal.availability = true
    deal.website 

    deal
end

def self.scrape_meh
    doc = Nokogiri::HTML(open("https://meh.com/"))

    deal = self.new
    deal.name = doc.search("section.features h2").text.strip
    deal.price = doc.search("#button.buy-button").text.gsub("Buy it.", "").strip
    deal.url = "https://meh.com/"
    deal.availability = true

    deal
end

返回是

// ♥  ./bin/daily-dealz
Todays Daily Deals
1. Apple Watch Blowout! - $129.99–$279.99$129.99$279.99 - true - 
2. 12-For-Tuesday: Fun Putty 1.8oz Tins

                                - 12 for $19 -  - true - 
Enter the number of the deal you'd like more info on or type list to see deals again or exit to exit 
program.

如何删除woot中的重复定价?如何删除meh中的尴尬间距?

ruby web-scraping nokogiri
1个回答
1
投票

有两个问题:

  1. #todays-deal span.price:三个元素符合此条件。让我们通过更改为>>使其更加具体

    #todays-deal .price-holder > span.price
    

    选择price-holder div及其下的第一个span.price

  2. 文本包含换行符。在gsub(/\s+/,' ')之后添加strip

  3. 参见此example

[另一注:#button.buy-button正在寻找按钮ID,而不是“按钮”类型的元素。将其更改为button.buy-button

© www.soinside.com 2019 - 2024. All rights reserved.