我有一个从 Twitter 收集的 URL 列表。所有 URL 都被缩短(如 bit.ly、buff.ly 等),并且我需要域名的完整链接,以便进行一些分析。由于有数千个 URL,我想自动化该过程,并尝试为此编写一个 for 循环。
我首先使用 httr:HEAD 编写一个简单的 for 循环,只要链接有效,它就可以正常工作。
library(tidyverse)
library(httr)
# Creating an example with the first few links in my data
shortened_urls <- c("buff.ly/2uDuJw9", "buff.ly/39zvF3Y", "buff.ly/39zvF3Y",
"buff.ly/39zvF3Y", "buff.ly/39VDwcd", "buff.ly/2U8oyZW", "buff.ly/2IRrrJ7",
"buff.ly/3a00mQb", "buff.ly/3a83WHM")
out_df <- data.frame() # empty df to store the results
for (i in 1:length(shortened_urls )) {
x <- HEAD(shortened_urls[i])
out_df <- bind_rows(out_df, data.frame(x$url))
}
这给出了包含完整链接的数据框的预期结果。 带有完整链接的良好数据框
但是,由于我想要收集的许多链接已经有好几年了,其中一些链接不再有效,这会导致循环出错,并在第 8 个 URL 处停止。我希望循环简单地忽略这些链接,然后继续。
我尝试向循环添加一个参数,以检查链接是否仍然有效,以及是否跳到下一个网址。
for (i in 1:length(shortened_urls )) {
valid_url <- TRUE
while(valid_url){
x <- HEAD(shortened_urls[i])
out_df <- bind_rows(out_df, data.frame(x$url))
valid_url <- httr::GET(shortened_urls[i])$status_code == "200"
if(!valid_url) {
next
}
}
}
但是,这会导致循环卡在第五个链接处的问题,并且只是一遍又一遍地写入此链接,而永远不会转到第六个链接或退出循环: 奇怪的数据框
即使我尝试简单地删除有问题的 URL,问题仍然存在,它只会继续迭代下一个 URL。
有这样的事吗?
tryCatch
将捕获可以随后测试的错误
library(httr)
# Creating an example with the first few links in my data
shortened_urls <- c("buff.ly/2uDuJw9", "buff.ly/39zvF3Y", "buff.ly/39zvF3Y",
"buff.ly/39zvF3Y", "buff.ly/39VDwcd", "buff.ly/2U8oyZW", "buff.ly/2IRrrJ7",
"buff.ly/3a00mQb", "buff.ly/3a83WHM")
# Create an empty list to store the results
out_urls <- character(length(shortened_urls)) |> as.list()
for (i in seq_along(shortened_urls )) {
x <- tryCatch(
HEAD(shortened_urls[i]),
error = function(e) e
)
if(inherits(x, "error")) {
out_urls[[i]] <- x
} else if(status_code(x) %/% 100 == 2) {
out_urls[[i]] <- x[["url"]]
} else out_urls[[i]] <- sprintf("error: %d", status_code(x))
}
ok <- !sapply(out_urls, inherits, "error")
out_df <- data.frame(url = unlist(out_urls[ok]))