我正在用httparty刮一堆表,然后用nokogiri解析响应。一切正常,但然后在顶部出现一个幻像行:
require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
url = "https://github.com/public-apis/public-apis"
parsed_page = Nokogiri::HTML(HTTParty.get(url))
# Get categories from the ul at the top
categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
# Get all tables from the page
tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
rows = []
# Acting on one first for testing before making it dynamic
tables[0].search('tr').each do |tr|
cells = tr.search('td')
link = ''
values = []
row = {
'name' => '',
'description' => '',
'auth' => '',
'https' => '',
'cors' => '',
'category' => '',
'url' => ''
}
cells.css('a').each do |a|
link += a['href']
end
cells.each do |cell|
values << cell.text
end
values << categories[0].text
values << link
rows << row.keys.zip(values).to_h
end
puts rows
end
scraper
控制台结果:
{"name"=>"Animals", "description"=>"", "auth"=>nil, "https"=>nil, "cors"=>nil, "category"=>nil, "url"=>nil}
{"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes",
...
第一行来自哪里?
您看到的第一行很可能是标题行。标题行使用<th>
而不是<td>
。这意味着cells = tr.search('td')
将是标题行的空集合。
[在大多数情况下,标题行放置在<thead>
中,而数据行放置在<tbody>
中。因此,除了执行tables[0].search('tr')
之外,您还可以执行tables[0].search('tbody tr')
,该操作仅选择<tbody>
标记中的行。
您的代码可以更简单,更灵活:
对此进行冥想:
require 'nokogiri'
require 'httparty'
URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]
doc = Nokogiri::HTML(HTTParty.get(URL))
category = doc.at('article li a').text
rows = doc.at('article table').search('tr')[1..-1].map { |tr|
values = tr.search('td').map(&:text)
link = tr.at('a')['href']
Hash[
FIELDS.zip(values + [category, link])
]
}
将导致:
puts rows
# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}
与您的代码有关的问题是:
使用search('some selector')[0]
与at('some selector')
相同,只有第二个更清晰,从而减少了视觉噪音。
依靠绝对XPath选择器:绝对选择器非常脆弱。对HTML所做的任何更改都极有可能被破坏。相反,找到有用的节点来检查它们是否唯一,然后让解析器找到它们。
使用CSS选择器'article li a'
跳过所有节点,直到找到“文章”节点,在其中寻找子“ li”并跟随“ a”。
类似地,at('article table')
在“文章”节点下找到第一个表,然后search('tr')
仅在该表中找到嵌入的行。
因为要跳过表头[1..-1]
切片NodeSet并跳过第一行。]] >>
[map
使构建结构更容易:
rows = doc.at('article table').search('tr')[1..-1].map { |tr|
在一次遍历该行循环中将字段分配给rows
。
values
被分配了每个“ td”节点文本的NodeSet文本。
您可以通过使用Hash的[]
构造函数并传入键/值对数组来轻松构建哈希。
FIELDS.zip(values + [category, link])
正在从单元格中获取值,并添加第二个包含类别和行中链接的数组。
我的示例代码基本上是相同的模板每个每次我用表格抓取页面时。会有一些细微的差异,但这是在表上循环,以提取单元格并将其转换为哈希值。甚至有可能在干净书写的表上自动从表第一行的单元格文本中获取哈希键。