如何让 ''html_attr("href")'' 返回完整的链接?

问题描述 投票:0回答:1

好的,首先这是我的代码。

library(rvest)
library(httr)
library(RSelenium)
library(tidyverse)
library(httr2)

url \<- "https://www.congress.gov/congressional-record/50th-congress/browse-by-date"
target_page \<- read_html(url)

x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([email protected])'))
target_page <- x %>% read_html(url)
target_page

driver <- rsDriver(browser = "firefox",
               chromever = NULL)
remote_driver <- driver[["client"]]
remote_driver$open()
remote_driver$navigate(url)
remote_driver$getTitle()

html_source <- remote_driver$getPageSource()
target_page <- read_html(html_source[[1]])

target_page

tab = html_nodes(target_page, xpath='//*[@id="innerbox_20-3"]')
daylinks <- tab %>% html_nodes("a")
links <- daylinks %>% html_attr("href")

links

我想弄清楚如何得到这条线:

links <- daylinks %>% html_attr("href")

创建链接列表,但它不断返回诸如 “/bound-congressional-record/1889/03/02/senate-section” 代替 'https://www.congress.gov/bound-congressional-record/1889/03/02/senate-section'

我该如何解决这个问题??? 另外,如果可能的话,是否有一种简单的方法来打开链接,以便我也可以开始抓取这些页面?

谢谢!

r web-scraping rvest
1个回答
0
投票

也许,这可以通过一个简单的“技巧”来避免:

paste0("https://www.congress.gov", daylinks %>% html_attr("href"))
© www.soinside.com 2019 - 2024. All rights reserved.