[使用Nokogiri从网站抓取时如何访问文本节点

问题描述 投票:0回答:2

我正在从两个站点抓取数据。首先刮擦其他,然后将价格重复两次。第二个站点抓取了正确的数据,但是返回了一个间距问题,我不确定该如何解决。

class DailyDealz::Deal
attr_accessor :name, :price, :availability, :url

def self.today
 # Scrape woot and meh and then return deals based on that data
 self.scrape_deals
end

def self.scrape_deals
    deals = []

    deals << self.scrape_woot
    deals << self.scrape_meh
    # deals << self.scrape_steepandcheap

    deals
end

def self.scrape_woot
    doc = Nokogiri::HTML(open("https://www.woot.com/"))

    deal = self.new
    deal.name = doc.search("h2.main-title").text.strip
    deal.price = doc.search("#todays-deal span.price").text.strip
    deal.url = doc.search("a.wantone").first.attr("href").strip
    deal.availability = true
    deal.website 

    deal
end

def self.scrape_meh
    doc = Nokogiri::HTML(open("https://meh.com/"))

    deal = self.new
    deal.name = doc.search("section.features h2").text.strip
    deal.price = doc.search("#button.buy-button").text.gsub("Buy it.", "").strip
    deal.url = "https://meh.com/"
    deal.availability = true

    deal
end

返回此:

// ♥  ./bin/daily-dealz
Todays Daily Deals
1. Apple Watch Blowout! - $129.99–$279.99$129.99$279.99 - true - 
2. 12-For-Tuesday: Fun Putty 1.8oz Tins

                                - 12 for $19 -  - true - 
Enter the number of the deal you'd like more info on or type list to see deals again or exit to exit 
program.

我该如何删除重复的定价和不足的间隔?

ruby web-scraping nokogiri
2个回答
1
投票

有两个问题:

  1. #todays-deal span.price:三个元素符合此条件。让我们通过更改为>>使其更加具体

    #todays-deal .price-holder > span.price
    

    选择price-holder div及其下的第一个span.price

  2. 文本包含换行符。在gsub(/\s+/,' ')之后添加strip

  3. 参见此example

[另一注:#button.buy-button正在寻找按钮ID,而不是“按钮”类型的元素。将其更改为button.buy-button


0
投票

请勿使用内核的open,该内核已被覆盖且已不建议使用此方法:

© www.soinside.com 2019 - 2024. All rights reserved.