我想从这个 carrier link 中抓取数据,我在 R 中使用 rvest 包,我使用下面的代码抓取了网页中的一些顶级信息:
library(rvest)
url <- "https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true"
page <- read_html(url)
# Extract the table on the page
table <- page %>% html_nodes("table") %>% .[[2]] %>% html_table()
# Print the table
View(table)
这是一个简单的方法:
library(rvest)
sess <- session("https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true")
html_table(sess)[[9]]
# # A tibble: 10 × 3
# Date Time Description
# <chr> <chr> <chr>
# 1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL
# 2 2022-06-24 04:22 Shipment arrived at destination Service Center TAMPA, FL
# 3 2022-06-24 03:02 Shipment departed ORLANDO Service Center
# 4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center
# 5 2022-06-22 22:54 Shipment departed DOTHAN Service Center
# 6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center
# 7 2022-06-21 10:36 Shipment departed HOUSTON Service Center
# 8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center
# 9 2022-06-20 19:59 Shipment departed WESLACO Service Center
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX
[[9]]
的使用是基于查看 html_table()
返回的所有表格,没有任何东西可以保证数字会持续存在。
查找表的更好方法是查找特定属性/标题/名称/id,最好使用 SelectorGadget 找到。
稍微详细看一下URL页面,发现该表的parent节点有
class="tracingInformation"
,说明我们可以这样做:
html_element(sess, ".tracingInformation") %>%
html_children() %>%
html_table()
# [[1]]
# # A tibble: 10 × 3
# Date Time Description
# <chr> <chr> <chr>
# 1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL
# 2 2022-06-24 04:22 Shipment arrived at destination Service Center TAMPA, FL
# 3 2022-06-24 03:02 Shipment departed ORLANDO Service Center
# 4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center
# 5 2022-06-22 22:54 Shipment departed DOTHAN Service Center
# 6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center
# 7 2022-06-21 10:36 Shipment departed HOUSTON Service Center
# 8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center
# 9 2022-06-20 19:59 Shipment departed WESLACO Service Center
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX