我无法使用httr包从API请求中下载或读取zip文件。我是否可以尝试使用另一个软件包,使我可以下载/读取R中get请求的响应中存储的二进制zip文件?
我尝试了两种方法:
使用GET获取应用程序/ json类型的响应对象(成功),然后使用fromJSON通过content(my_response,'text')提取内容。输出包括名为“ zip”的列,这是我要下载的数据,该文档的状态为base64编码的二进制文件。该列当前是一个很长的随机字母字符串,我不确定如何将其转换为实际数据集。
我尝试使用fromJSON绕过,因为我注意到响应对象本身中有一个类'raw'的字段。该对象是一个随机数列表,我怀疑是数据集的二进制表示形式。我尝试使用rawToChar(my_response $ content)尝试将原始数据类型转换为字符,但这会导致生成与#1中相同的长字符串。
getzip1 <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))
df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")
# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e. this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)
# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip
我应该能够使用响应中的原始'content'对象或fromJSON输出的'zip'对象中的长字符串,以便查看数据集或以某种方式下载它。我不知道该怎么做。请帮助!
欢迎!
基于API的documentation,对getDataset
端点的响应具有模式
数据集档案库,包括元信息,数据集本身经过base64编码以允许二进制ZIP转移。
{
"status": "OK",
"dataset": {
"state_id": 5,
"session_id": 1624,
"session_name": "2019-2020 Regular Session",
"dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
"dataset_date": "2018-12-23",
"dataset_size": 317775,
"mime": "application\/zip",
"zip": "MIME 64 Encoded Document"
}
}
我们可以使用R通过以下代码来获取数据,
library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
token,
"&op=getDataset&id=",
session_id,
"&access_key=",
access_key) %>%
GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() # This contains some extra metadata
content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() %>%
getElement(name = "dataset") %>%
getElement(name = "zip") %>%
base64_dec() %>%
writeBin(con = destfile)
unzip(zipfile = destfile)
unzip
将解压缩文件,在这种情况下看起来像]
hash.md5 # Can be checked again metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json
和往常一样,将代码包装在函数和利润中。
PS:这是代码在Julia中的比较方式。
using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
join(["key=$token",
"op=getDataset",
"id=$session_id",
"access_key=$access_key"],
"&")) |>
HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
(content -> content.dataset.zip) |>
base64decode |>
(data -> write(destfile, data))
run(pipeline(`unzip`, destfile))