使用cheerio进行DOM遍历 - 如何获取所有元素及其对应的文本

Question

所以我使用 Cheerio，一个类似于 Node 服务器端 jQuery 的库，它允许您解析 html 文本并遍历它，就像使用 jQuery 一样。我需要获取html正文的纯文本，但不仅如此，我还需要获取相应的元素和数字。 IE：如果在第三段元素中找到纯文本，我会得到类似的内容：

{
    text: <element plaintext>,
    element: "p-3"
}

我目前有以下函数尝试执行此操作：

var plaintext_elements = traverse_tree($('body'));    

function traverse_tree(root, found_elements = {}, return_array = []) {
    if (root.children().length) {
        //root has children, call traverse_tree on that subtree
        traverse_tree(root.children().first(), found_elements, return_array);
    }
    root.nextAll().each(function(i, elem) {
        if ($(elem).children().length) {
            //if the element has children call traverse_tree on the element's first child
            traverse_tree($(elem).children().first(), found_elements, return_array)
        }
        else {
            if (!found_elements[$(elem)[0].name]) {
                found_elements[$(elem)[0].name] = 1;
            }
            else {
                found_elements[$(elem)[0].name]++
            }
            if ($(elem).text() && $(elem).text != '') {
                return_array.push({
                    text: $(elem).text(),
                    element: $(elem)[0].name + '-' + found_elements[$(elem)[0].name]
                })
            }
        }
    })


    if (root[0].name == 'body') {
        return return_array;
    }

}

我的方向正确吗？我应该尝试其他事情吗？对此的任何帮助将不胜感激。同样，这不是 jQuery，而是服务器端的 Cheerio。（但是它们非常相似）

我认为如果你使用

Answer 1

css 选择器

，很多遍历都是不需要的

function textElements($){
  const found = {}
  return $('body *').map(function(el){
    if ( $(this).children().length || $(this).text() === '' ) return
    found[this.name] = found[this.name] ? 1 + found[this.name] : 1
    return {
      text: $(this).text(),
      element: `${this.name}-${found[this.name]}`,
    }
  }).get()
}

textElements(cheerio.load(html)

像这样的东西怎么样：

Answer 2

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = `<!DOCTYPE html>
<html><body>
<div>
  <p>
    foo
    <b>bar</b>
  </p>
  <p>
    baz
    <b>quux</b>
    garply
  </p>
  corge
</div>
</body>
</html>`;

const $ = cheerio.load(html);
const indices = {};
const els = [...$("*")]
  .flatMap(e =>
    [...$(e).contents()].filter(
      e => e.type === "text" && $(e).text().trim()
    )
  )
  .map(e => {
    const text = $(e).text().trim();
    const {name: element} = $(e).parent()[0];
    indices[element] = indices[element] || 0;
    return {text, element, nth: indices[element]++};
  });
console.log(els);

输出：

[
  { text: 'corge', element: 'div', nth: 0 },
  { text: 'foo', element: 'p', nth: 0 },
  { text: 'bar', element: 'b', nth: 0 },
  { text: 'baz', element: 'p', nth: 1 },
  { text: 'garply', element: 'p', nth: 2 },
  { text: 'quux', element: 'b', nth: 1 }
]

这使用

.contents()

并过滤掉任何非文本节点和仅限空白的文本节点。

.parent()

 可以访问与每个文本节点对应的标签。一个对象用于跟踪每个标签的索引，这可能有点hacky，但我不完全确定该部分需求的预期输出是什么。

使用cheerio进行DOM遍历 - 如何获取所有元素及其对应的文本

问题描述投票：0回答：2

2个回答

最新问题

使用cheerio进行DOM遍历 - 如何获取所有元素及其对应的文本

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2