如何从html页面过滤其中包含文本的HTML节点

Question

我是网络抓取的新手并且遇到了问题

我正在使用BeautifulSoup来抓取网页。我想获得其中包含文本的节点。

我尝试使用像这样的get_text（）方法

  soup = BeautifulSoup(open('FAQ3.html'), "html.parser")                               
  body = soup.find('body')                                                                                                                  
  for i in body:                                                                       
    if type(i) != bs4.element.Comment and type(i)!= bs4.element.NavigableString :     
      if i.get_text():                                                             
        print(i)

但get_text正在给节点，即使它的孩子有文本，

示例html：

<div>
  <div id="header">
        <script src="./FAQ3_files/header-home.js"></script>
  </div>
  <div>
   <div>
      this node contain text
    </div>
 </div>
</div>

检查最顶层的div本身，它返回整个节点，因为最里面的文本在其中，

如何迭代所有节点并仅过滤实际上有文本的节点？

Answer 1

我使用深度优先搜索，它解决了我的用例

def get_text_bs4(self, soup, leaf):
        if soup.name is not None:
            if soup.string != None and soup.name != 'script': 
                    if soup.text not in leaf:
                        leaf[soup.text] = soup
            for child in soup.children:
                self.get_text_bs4(child, leaf)
        return leaf

如何从html页面过滤其中包含文本的HTML节点

问题描述投票：1回答：1

1个回答

最新问题

如何从html页面过滤其中包含文本的HTML节点

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1