当使用 rvest
检查从
https://scrapeme.live/抓取的一些 HTML 的树结构时,我注意到所有换行符和空格最终都作为文本节点 - 在查询 DOM 时我发现这有点烦人。有什么方法可以忽略这些(而不是事后清理数据)?在此示例中,我只想捕获“实际”文本。
library(rvest)
library(xml2)
html <- read_html("https://scrapeme.live/")
html |>
html_element("div.page-content") |>
html_structure(indent = 4)
#> <div.page-content>
#> {text}
#> <p>
#> {text}
#> {text}
#> <form.search-form [role, method, action]>
#> {text}
#> <label [for]>
#> {text}
#> <span.screen-reader-text>
#> {text}
#> {text}
#> {text}
#> <input#search-form-65d0cce0a5dd2 .search-field [type, placeholder, value, name]>
#> <button.search-submit [type]>
#> <svg.icon.icon-search [aria-hidden, role]>
#> <use [href, xlink:href]>
#> <span.screen-reader-text>
#> {text}
#> {text}
#> {text}
html |>
html_element("div.page-content") |>
html_elements(xpath = ".//text()")
#> {xml_nodeset (11)}
#> [1] \n\t\t\n\t\t\t
#> [2] It seems we can’t find what you’re looking for. Perhaps searching can help.
#> [3] \n\t\t\t\n\n
#> [4] \n\t
#> [5] \n\t\t
#> [6] Search for:
#> [7] \n\t
#> [8] \n\t
#> [9] Search
#> [10] \n
#> [11] \n\t
创建于 2024-02-17,使用 reprex v2.1.0
你在找吗
html_text2()
library(rvest)
"https://scrapeme.live" %>%
read_html() %>%
html_element("div.page-content") %>%
html_text2()
[1] "It seems we can’t find what you’re looking for. Perhaps searching can help.\n\nSearch for: Search"