用 rvest 抓取错误没有适用于“xml_find_first”的方法应用于“字符”类的对象

问题描述 投票:0回答:1

我正在尝试使用 rvest 在 booking.com 中抓取一个页面,问题是当酒店没有评级时,我需要代码返回 NA,因此数据框将具有每个参数的确切行数我正在尝试刮。

我正在使用的代码无需返回 NA 即可完美运行:

# Necessary packages
  library(rvest)
  library(dplyr)
  library(httr)
  
# Base URL of the search results page
  base_url <- "https://www.booking.com/searchresults.it.html"
  
  
# Parameters we add to the search get the specific results 
  params <- list(
    ss = "Firenze%2C+Toscana%2C+Italia",
    efdco = 1,
    label = "booking-name-L*Xf2U1sq4*GEkIwcLOALQS267777916051%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-65526620%3Alp9069992%3Ali%3Adec%3Adm%3Appccp",
    aid = 376363,
    lang = "it",
    sb = 1,
    src_elem = "sb",
    src = "index",
    dest_id = -117543,
    dest_type = "city",
    ac_position = 0,
    ac_click_type = "b",
    ac_langcode = "it",
    ac_suggestion_list_length = 5,
    search_selected = "true",
    search_pageview_id = "2e375b14ad810329",
    ac_meta = "GhAyZTM3NWIxNGFkODEwMzI5IAAoATICaXQ6BGZpcmVAAEoAUAA%3D",
    checkin = "2023-06-11",
    checkout = "2023-06-18",
    group_adults = 2,
    no_rooms = 1,
    group_children = 0,
    sb_travel_purpose = "leisure"
  )
  
  
# Create empty vectors to store the titles, rating, price
  titles <- c()
  ratings <- c()
  prices <- c()

### Loop through each page of the search results
  for (page_num in 1:35) {
    
# Build the URL for the current page
    url <- modify_url(base_url, query = c(params, page = page_num))
    
# Read the HTML of the new page specificated
    page <- read_html(url)
    
# Extract the titles, rating, price from the current page
# Got the elements from Inspect code of the page
    titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
    
    prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
    ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
    
# Append the titles, ratings, prices from the current page to the vector
    titles <- c(titles, titles_page)
    prices <- c(prices, prices_page)
    ratings <- c(ratings, ratings_page)
  }
  
  hotel = data.frame(titles, prices, ratings)
  
  print(hotel)```

I have seen being suggested to add a paretn and children node and I have tried this but it does not function:

```titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
  
  prices_page <- titles_page %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()
  ratings_page <- titles_page %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()```
r web-scraping dplyr rvest httr
1个回答
0
投票

titles_page <- page %>% html_elements("div[data-testid='title']") %>% html_text()
正在创建一个字符串向量。
您无法在下一行代码中解析“titles_page”。
您正在跳过创建父节点向量的步骤。查看您之前的问题/答案如何在使用 R 抓取网页时报告 NA 并且它没有价值? 并查看答案中的行
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
。这将返回一个 xml 节点向量。现在解析这个节点向量以获得所需的信息。

错误是这些行不正确:

#find the parents
properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
       
#getting the information from each parent
titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()    
ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()

完整的更正循环现在是:

for (page_num in 1:35) { 
   # Build the URL for the current page
   url <- modify_url(base_url, query = c(params, page = page_num))
   
   # Read the HTML of the new page specificated
   page <- read_html(url)
   
   #parse out the parent node for each parent 
   properties <- html_elements(page, xpath=".//div[@data-testid='property-card']")
   
   #now find the information from each parent
   titles_page <- properties %>% html_element("div[data-testid='title']") %>% html_text()
   prices_page <- properties %>% html_element("span[data-testid='price-and-discounted-price']") %>% html_text()    
   ratings_page <- properties %>% html_element("div[aria-label^='Punteggio di']") %>% html_text()
   
   # Append the titles, ratings, prices from the current page to the vector
   titles <- c(titles, titles_page)
   prices <- c(prices, prices_page)
   ratings <- c(ratings, ratings_page)
}
© www.soinside.com 2019 - 2024. All rights reserved.