如何使用Nokogiri解析span标签内的文本

问题描述 投票:1回答:2

我想构建一个应用程序来显示流行场所中的艺术家,并且只提取艺术家的名字。

这是我的代码:

data.css('.headliner').each do |artist|
puts artist
end

目前正在返回:

<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>

某些元素具有多个span标签,但我无法获取所需的数据。我要返回的只是艺术家的名字,例如“伦敦语法”,“霍兹尔”,“本·霍华德”和“博士”。狗”。

当前,当我运行artist.text时,它返回“重新计划的DateLondon语法”,依此类推。


<table class="concert_calendar" cellspacing="0" width="720" style="margin-top:35px;">
    <tbody><tr><td class="noborder"><img src="images/title_date2.gif" alt="Date"></td>
    	<td class="noborder" colspan="2"><img src="images/title_show2.gif" alt="Show"></td>
        <td class="noborder"><img src="images/title_time2.gif" alt="Time"></td>
        <td class="noborder"><img src="images/title_tickets2.gif" alt="Tickets"></td></tr>
    <tr><td colspan="5" class="noborder"><hr size="1" color="#550818" noshade="" style="margin:0px; padding:0px;"></td></tr>
		<tr><td style="width:100px;" class="">Saturday,<br>February 7</td>
    	<td style="width:115px;" valign="top" class=""><a href="popartist.php?cID=4600&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" class="con_img thickbox"><img src="http://www.apeconcerts.com/concertimages/LondonGrammar_100.jpg" alt="London Grammar"></a></td>
        <td valign="top" style="width:345px; padding-right:10px;" class="">
        	<a href="popartist.php?cID=4600&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" style="text-decoration:none;" class="thickbox">
            	<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span></a>
        	<div><span class="warmup">Until The Ribbon Breaks</span><br>
            <span class="warmup"></span></div></td>
        <td style="width:80px;">show<br>8:00PM</td>
        <td style="width:80px;">
        <img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!">        </td></tr>
		<tr><td style="width:100px;">Tuesday,<br>February 10</td>
    	<td style="width:115px;" valign="top"><a href="popartist.php?cID=4733&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" class="con_img thickbox"><img src="http://www.apeconcerts.com/concertimages/Hozier_1001.jpg" alt="Hozier"></a></td>
        <td valign="top" style="width:345px; padding-right:10px;" class="">
        	<a href="popartist.php?cID=4733&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" style="text-decoration:none;" class="thickbox">
            	<span class="headliner">Hozier</span></a>
        	<div class=""><span class="warmup">Ásgeir</span><br>
            <span class="warmup"></span></div></td>
        <td style="width:80px;">show<br>8:00PM</td>
        <td style="width:80px;">
        <img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!">        </td></tr>
ruby nokogiri
2个回答
2
投票

我只想返回艺术家的名字,例如“伦敦语法”,'Hozier','Ben Howard'和'Dr.狗'

这里是一种方式:

require 'nokogiri'

html = %q{
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
}

html_doc = Nokogiri::HTML(html)
headliners = html_doc.css('.headliner')

headliners.each do |headliner|
  headliner.css('i').each do |i|
    i.content = ''
  end

  puts headliner.text
end

--output:--
London Grammar
Hozier
Ben Howard
Dr. Dog

-1
投票

如果您想做的只是删除<i>标签的内容,则只需完全删除标签:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
EOT

doc.search('.headliner i').map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <span class="headliner"><span class="prepend"></span><br>London Grammar</span>
# >> <span class="headliner">Hozier</span>
# >> <span class="headliner"><span class="prepend"></span><br>Ben Howard<br><span class="append"><br></span></span>
# >> <span class="headliner">Dr. Dog</span>
# >> </body></html>

那时,很容易遍历.headliner标签并输出其内容:

puts doc.search('.headliner').map(&:text)

# >> London Grammar
# >> Hozier
# >> Ben Howard
# >> Dr. Dog

对于包含很多与.headliner匹配的标签的大页面,我可能会做些不同,但这对于普通页面就足够了。

另请参见“ How to avoid joining all text from Nodes when scraping”。

© www.soinside.com 2019 - 2024. All rights reserved.