How to web scrape table element using rvest?

问题描述 投票:0回答:1

我想从这个 carrier link 中抓取数据,我在 R 中使用 rvest 包,我使用下面的代码抓取了网页中的一些顶级信息:

library(rvest)

url <- "https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true"
page <- read_html(url)

# Extract the table on the page
table <- page %>% html_nodes("table") %>% .[[2]] %>% html_table()

# Print the table
View(table)

产生以下信息:

但是,我希望以表格格式从跟踪信息表中检索信息:

html css r web-scraping rvest
1个回答
2
投票

这是一个简单的方法:

library(rvest)
sess <- session("https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true")
html_table(sess)[[9]]
# # A tibble: 10 × 3
#    Date       Time  Description                                               
#    <chr>      <chr> <chr>                                                     
#  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
#  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
#  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
#  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
#  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
#  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
#  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
#  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
#  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            

[[9]]
的使用是基于查看
html_table()
返回的所有表格,没有任何东西可以保证数字会持续存在。

查找表的更好方法是查找特定属性/标题/名称/id,最好使用 SelectorGadget 找到。

稍微详细看一下URL页面,发现该表的parent节点有

class="tracingInformation"
,说明我们可以这样做:

html_element(sess, ".tracingInformation") %>%
  html_children() %>%
  html_table()
# [[1]]
# # A tibble: 10 × 3
#    Date       Time  Description                                               
#    <chr>      <chr> <chr>                                                     
#  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
#  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
#  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
#  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
#  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
#  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
#  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
#  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
#  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            
© www.soinside.com 2019 - 2024. All rights reserved.