如何仅搜索和更新某些文本,而在某些元素名称内保留文本

问题描述 投票:2回答:1

我有这个HTML片段:

<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>

我需要将plane替换为

<a href="/some/url">plane</a>

但仅当它在<a></a>定位标记之外,并且在标题<h1-h6></h>标记之外。

这是我尝试过的:

require 'Nokogiri'
h = '<p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes.  No.  Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird?  Is it a plane?  No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse

# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content 

# Try 2: The below line removes headings permanently - I need them to remain 
# doc.search(".//h2").remove

# Try 3: This just comes out empty - why?
# doc.xpath('text()')    
# doc.xpath('//text()')

# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html

我尝试了xpath的其他各种变化,但均无济于事。我在做什么错?

ruby nokogiri
1个回答
0
投票

经过一些测试之后,您似乎需要使用XPath选择器p/text()。然后事情就变得更加复杂,因为您试图用link元素替换普通文本。

[当我刚尝试使用gsub时,Nokogiri正在转义新链接,因此我需要将text元素拆分为多个同级元素,在其中可以用link元素代替文本节点替换某些同级元素。

doc.xpath('p/text()').grep(/plane/) do |node|
  node_content, *remaining_texts = node.content.split(/(plane)/)

  node.content = node_content
  remaining_texts.each do |text|
    if text == 'plane' 
      node = node.add_next_sibling('<a href="/some/url">plane</a>').last
    else
      node = node.add_next_sibling(text).last
    end
  end
end

puts doc
# <p>Yes.  No.  Both. Maybe a <a href="/plane">plane</a>?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes.  No.  Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird?  Is it a <a href="/some/url">plane</a>?  No, it’s Superman.</p>

除标题和链接之外,所有元素的通用XPath选择器可能是:

*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()

您可能需要进行一些调整,因为我不是XML或Nokogiri专家,但在我看来,至少在为所提供的示例中工作,因此它应该可以帮助您。

© www.soinside.com 2019 - 2024. All rights reserved.