将Nokogiri元素分配给哈希键有麻烦

问题描述 投票:0回答:1

我已经尝试学习红宝石已有几个月了,我正在与Nokogiri一起练习刮刮。我正在抓取Techcrunch.com,并获取每篇文章的标题,URL和预览文本。到目前为止,我有:

require 'nokogiri'
require 'open-uri'

class TestScraper::Scraper
@doc = Nokogiri::HTML(open("https://techcrunch.com")


  def scrape_tech_crunch
    articles = @doc.css("h2.post-block__title").css("a")
    top_stories = articles.each do |story|
      stories = {
        :title => story.children.text.strip,
        :url => story.attribute("href").value,
        :preview => @doc.css("div.post-block__content").children.first.text
      }
      TestScraper::Article.new(stories)
    end
  end
end

TestScraper :: Article.new(stories)以参数中的哈希值作为参数,并使用它像这样初始化Article类:

class TestScraper::Article
  attr_accessor :title, :url, :preview 

  @@all = []

  def initialize(hash)
    hash.each do |k, v|
      self.send "#{k}=", v
    end
    @@all << self
  end

  def self.all
    @@all
  end
end

当我运行TestScraper :: Scraper.new(“ https://techcrunch.com”)。scrape_tech_crunch

我得到:

[#<TestScraper::Article:0x00000000015f69e0
  @preview=
   "\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
  @title=
   "Millions downloaded dozens of Android apps on Google Play infected with adware",
  @url=
   "https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/">,
 #<TestScraper::Article:0x00000000015f5658
  @preview=
   "\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
  @title="Netflix launches $4 mobile-only monthly plan in Malaysia",
  @url=
   "https://techcrunch.com/2019/10/24/netflix-malaysia-mobile-only-cheap-plan/">

您可以看到,它为文章类的每个实例创建了具有适当标题和URL的对象,但是它始终为每个文章实例分配相同的预览文本。应该有20篇文章,每篇文章都有自己的“预览”,“预览”是指您在单击链接以阅读全文之前获得的文章的小样本。

很抱歉,冗长的帖子。我是新手,似乎无法正确解决这一问题。感谢您事先提供的帮助。

-ruby n00b

ruby iteration html-parsing nokogiri
1个回答
1
投票

您遇到的问题是由于以下事实:>

@doc.css("div.post-block__content").children.first.text

为每个故事选择相同的节点,因为您在全局文档@doc上对其进行了调用。

而不是尝试找到最常见的节点,然后从那里向下走:

@doc.css('.post-block').map do |story|
  # navigate down from the selected node
  title   = story.at_css('h2.post-block__title a')
  preview = story.at_css('div.post-block__content')

  TestScraper::Article.new(
    title:   title.content.strip,
    href:    title['href'],
    preview: preview.content.strip
  )
end

[如果任何使用的方法引起问题,请查看Nokogiri cheat sheet。如果您在此之后有任何疑问,请不要在评论中提出疑问。

© www.soinside.com 2019 - 2024. All rights reserved.