R:read_html() + html_text() 的替代方案/方法也适用于没有 HTML/XML 标签的字符串

问题描述 投票:0回答:1

此解决方案中,要从字符串中删除 HTML 标签,字符串会传递到

rvest::read_html()
以创建
html_document
对象,然后将该对象传递到
rvest::html_text()
以返回“无 HTML 文本”。

但是,如果字符串不包含 HTML 标记,

read_html()
会抛出错误,因为该字符串被视为文件/连接路径,如下所示。当尝试从许多可能不包含任何标签的字符串中删除 HTML 时,这可能会出现问题。

library(rvest)

# Example data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)


# Success: produces html_document object
rvest::read_html(dat[1])
#> {html_document}
#> <html>
#> [1] <body>\n<b>Positives:</b> Rangy, athletic build with room for additional  ...


# Error
rvest::read_html(dat[2])
#> Error in `path_to_connection()`:
#> ! 'Positives: Better football player than his measureables would
#>   indicate. ...' does not exist in current working directory
#>   ('C:/LONG_PATH_HERE').

有没有一种快速的方法来确保

read_html()
将每个字符串视为xml,即使它不包含任何标签,或者删除HTML以达到与
read_html() |> html_text()
相同的效果?

一个想法是简单地附加“”或“ " 到每个字符串的末尾。但是,我想有一种更有效的方法,当字符串缺少任何 HTML 时返回字符串而不进行任何计算,或者使用函数的参数来完成此操作。其他替代方案包括使用正则表达式删除标签,尽管这样做违反了 “不要在 html 上使用正则表达式” 原则。

html r xml rvest data-wrangling
1个回答
0
投票

你可以尝试一下:

### Packages
library(rvest)
library(purrr)

### Data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)

### Writing a function to convert each string to raw, parse it with read_html then extract the text
clean=function(x) {
  read_html(charToRaw(x)) %>% html_text()
}

### Map the function over the character vector
map_chr(dat,clean,.progress = TRUE)

输出:

[1] "Positives: Rangy, athletic build with room for additional growth. ..."      
[2] "Positives: Better football player than his measureables would indicate. ..."
© www.soinside.com 2019 - 2024. All rights reserved.