我正在尝试如何从网站上抓取数据。
这是我经过几天的研究后整理出来的,但是,Nokogiri 的输出并不像我预期的那么“干净”。当我打印数组时,输出中出现很多换行符“
/n
”。
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
在 Pry 中,如果我显示
details_array
或 address_array
,输出如下所示:
[2] pry(main)> details_array
=> ["\n \n \n \n 2265 Tanglewood Cir NE,\n Atlanta,\n GA\n 30345\n \n \n\n \n Dresden East\n \n \n\n $289,900\n \n \n \n 3 bd\n 2 ba\n 1,566 sq ft\n
0.3 acres lot\n \n \n \n \n Single Family Home\n \n \n \n \n
Brokered by Re/Max Town And Country\n \n \n
\n \n \n Brokered by \n Re/Max
Town And Country\n \n \n \n ", "\n \n
\n \n 2141 Dunwoody Gln,\n
Atlanta,\n GA\n 30338\n \n \n\n
\n \n $469,900\n \n \n
\n 4 bd\n 3 ba\n 2,850 sq
ft\n 0.3 acres lot\n 2 car\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Buckhead Home Realty Llc\n \n \n \n
\n \n Brokered by \n Buckhead Home
Realty Llc\n \n \n \n ", "\n \n
\n \n 1048 Martin St SE,\n
Atlanta,\n GA\n 30315\n \n \n\n
\n Intown South\n Peoplestown\n \n \n
\n $164,900\n \n \n \n
5 bd\n 3 ba\n 2,376 sq ft\n
7,405 sq ft lot\n \n \n \n \n
Single Family Home\n \n \n \n \n
Brokered by Greenlet Llc\n \n \n \n
\n \n Brokered by \n Greenlet Llc\n
\n \n \n ", "\n \n \n \n
1048 Martin St SE,\n Atlanta,\n GA\n
30315\n \n \n\n \n Intown South\n
Peoplestown\n \n \n \n $164,900\n
\n \n \n 5 bd\n 3
ba\n 2,055 sq ft\n 7,584 sq ft lot\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Greenlet, Llc\n \n \n \n \n
\n Brokered by \n Greenlet, Llc\n \n
\n \n ", "\n \n \n \n
1991 Woodbine Ter NE,\n Atlanta,\n GA\n
30329\n \n \n\n \n Sagamore Hills\n
\n \n \n $299,900\n \n \n
\n 3 bd\n 1+ ba\n 1,449
sq ft\n 0.8 acres lot\n \n \n
\n \n Single Family Home\n \n \n
\n :
看起来您没有使用选择器深入研究文档。考虑一下:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<p>foo</p>
<p>bar</p>
</div>
</body>
</html>
EOT
doc.search('div').map(&:text) # => ["\n foo\n bar\n "]
查看父标签的文本时,您将获得用于格式化 HTML 的文本节点,以及所需的
<p>
节点的文本。
如果您深入到所需的实际节点,然后获取它们的文本,您将删除标签间格式:
doc.search('div p').map(&:text) # => ["foo", "bar"]
另请参阅“抓取时如何避免加入节点中的所有文本”。
你得到这些额外换行符的原因 是因为在 Nokogiri 中使用 text 方法返回文本内容,包括元素之间的所有空格。
因此,您可以使用 strip 方法来清理输出并删除不必要的空白。
3.您可以使用修改后的代码来获得干净的输出:
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
# Use strip to remove leading and trailing whitespace
# Use chomp method will also removes carriage return characters (that is it will remove \n, \r, and \r\n)
property_details = d.text.strip.chomp
details_array.push(property_details)
end
Pry.start(binding)