所以我编写了这个脚本来自动从维基百科的一般信息框中下载一个物种的图片。我有这个数据框,其中包含该物种的所有(拉丁)名称,然后我想自动下载维基百科物种图片并将它们放在地图上。
维基百科链接示例: https://en.wikipedia.org/wiki/Eurasian_eagle-owl
但是,我的脚本下载了图片的低质量版本。我该如何修改它才能以最佳质量下载原始文件?
数据框示例:
> bird_names
[1] "Prunella modularis" "Myiopsitta monachus"
[3] "Pyrrhura perlata" "Tyto alba"
[5] "Panurus biarmicus" "Merops apiaster"
脚本:
# Function to download and save an image from Wikipedia
download_wikipedia_image <- function(bird_name) {
# Construct the Wikipedia URL for the bird species
wikipedia_url <- paste0("https://en.wikipedia.org/wiki/", gsub(" ", "_", bird_name))
# Read the HTML content of the Wikipedia page
page <- read_html(wikipedia_url)
# Extract all image URLs from the page
image_urls <- page %>%
html_nodes("table.infobox img") %>%
html_attr("src")
# Download and save the first image (if available)
if (length(image_urls) > 0) {
download.file(paste0("https:", image_urls[1]), paste0("BIRDPHOTO/", gsub(" ", "_", bird_name), ".jpg"))
cat("Downloaded photo for", bird_name, "\n")
} else {
cat("No photo found for", bird_name, "\n")
}
}
# Create BIRDPHOTO directory if it doesn't exist
dir.create("BIRDPHOTO", showWarnings = FALSE)
# Loop through each bird name and download the corresponding image
for (bird_name in bird_names) {
download_wikipedia_image(bird_name)
}
# Optional: Print a message when all downloads are complete
cat("All downloads completed.\n")
那是因为你必须按照低质量照片访问 wiki 页面(即 https://en.wikipedia.org/wiki/File:Baardman_-_Panurus_biarmicus_(15147085070).jpg)并搜索
Original file
链接,例如:
bird_name <- "Panurus biarmicus"
# Construct the Wikipedia URL for the bird species
wikipedia_url <- paste0("https://en.wikipedia.org/wiki/", gsub(" ", "_", bird_name))
# Read the HTML content of the Wikipedia page
page <- xml2::read_html(wikipedia_url)
# Extract all image URLs from the page
urls <- page |>
rvest::html_nodes("table.infobox") |>
rvest::html_elements(css = "a.mw-file-description") |>
rvest::html_attr("href")
urls
#> [1] "/wiki/File:Baardman_-_Panurus_biarmicus_(15147085070).jpg"
#> [2] "/wiki/File:PanurusBiarmicusIUCN2019-3.png"
image_url <- xml2::read_html(paste0("https://en.wikipedia.org/", urls[[1]])) |>
rvest::html_nodes("div.fullMedia") |>
rvest::html_element(css = "a") |>
rvest::html_attr("href")
image_url
#> [1] "//upload.wikimedia.org/wikipedia/commons/f/fb/Baardman_-_Panurus_biarmicus_%2815147085070%29.jpg"
于 2023 年 12 月 11 日使用 reprex v2.0.2 创建
使用这个:
# Download and save the first image (if available)
if (length(image_urls) > 0) {
#extract the file name
file_name <- strsplit(image_urls[1],"/")[[1]][9]
#query API
req <- httr::GET(glue::glue("https://en.wikipedia.org/w/api.php?action=query&titles=File:{file_name}&prop=imageinfo&iiprop=url&format=json"))
cont <- httr::content(req)
#extract url from query
full_image_url <- cont$query$pages$`-1`$imageinfo[[1]]$url
download.file(full_image_url, paste0("BIRDPHOTO/", gsub(" ", "_", bird_name), ".jpg"))
cat("Downloaded photo for", bird_name, "\n")
} else {
cat("No photo found for", bird_name, "\n")
}