如何在r中提取明显非标准的html标签的值页面标题

问题描述 投票:0回答:1
html r web-scraping rvest
1个回答
0
投票

关于您的示例代码并假设您只想最后提取数字,我们可以使用

xpath
参数的解决方法并排除
<svg>
标记内的所有内容,然后
purrr::discard
所有空字符串:

library(rvest)
library(purrr)

html |> 
  read_html(html) |> 
  html_elements("p") |>
  html_nodes(xpath='//*[not(name()="svg")]/text()') |> 
  html_text(trim=TRUE) |> 
  purrr::discard(\(x) x == "")
#> [1] "94 - 100 m²" "3"           "3"           "2"

来自OP的数据

html <- '<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>'

创建于 2023-09-15,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.