如何下载和/或提取存储在R中响应对象内的“原始”二进制zip对象中的数据？

Question

我无法使用httr包从API请求中下载或读取zip文件。我是否可以尝试使用另一个软件包，使我可以下载/读取R中get请求的响应中存储的二进制zip文件？

我尝试了两种方法：

使用GET获取应用程序/ json类型的响应对象（成功），然后使用fromJSON通过content（my_response，'text'）提取内容。输出包括名为“ zip”的列，这是我要下载的数据，该文档的状态为base64编码的二进制文件。该列当前是一个很长的随机字母字符串，我不确定如何将其转换为实际数据集。
我尝试使用fromJSON绕过，因为我注意到响应对象本身中有一个类'raw'的字段。该对象是一个随机数列表，我怀疑是数据集的二进制表示形式。我尝试使用rawToChar（my_response $ content）尝试将原始数据类型转换为字符，但这会导致生成与＃1中相同的长字符串。
[我注意到，使用方法＃1，如果我使用base64_dec（）尝试转换长字符串，我还将获得与响应对象本身中的“原始”字段相同类型的输出。

getzip1  <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))

df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")

# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e.  this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)

# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip

我应该能够使用响应中的原始'content'对象或fromJSON输出的'zip'对象中的长字符串，以便查看数据集或以某种方式下载它。我不知道该怎么做。请帮助！

Answer 1

欢迎！

基于API的documentation，对getDataset端点的响应具有模式

数据集档案库，包括元信息，数据集本身经过base64编码以允许二进制ZIP转移。

{
 "status": "OK",
 "dataset": {
 "state_id": 5,
 "session_id": 1624,
 "session_name": "2019-2020 Regular Session",
 "dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
 "dataset_date": "2018-12-23",
 "dataset_size": 317775,
 "mime": "application\/zip",
 "zip": "MIME 64 Encoded Document"
 }
}

我们可以使用R通过以下代码来获取数据，

library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
                  token,
                  "&op=getDataset&id=",
                  session_id,
                  "&access_key=",
                  access_key) %>%
  GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
                as = "text",
                encoding = "utf8") %>%
  fromJSON() # This contains some extra metadata
content(x = response,
        as = "text",
        encoding = "utf8") %>%
  fromJSON() %>%
  getElement(name = "dataset") %>%
  getElement(name = "zip") %>%
  base64_dec() %>%
  writeBin(con = destfile)
unzip(zipfile = destfile)

unzip将解压缩文件，在这种情况下看起来像]

hash.md5 # Can be checked again metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json

和往常一样，将代码包装在函数和利润中。

PS：这是代码在Julia中的比较方式。

using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
                  join(["key=$token",
                        "op=getDataset",
                        "id=$session_id",
                        "access_key=$access_key"],
                        "&")) |>
    HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
    (content -> content.dataset.zip) |>
    base64decode |>
    (data -> write(destfile, data))
run(pipeline(`unzip`, destfile))

Answer 2

查看有关如何打开从URL下载的zip文件的答案

Getting a zip file with httr

如何下载和/或提取存储在R中响应对象内的“原始”二进制zip对象中的数据？

问题描述投票：0回答：2

2个回答

最新问题

如何下载和/或提取存储在R中响应对象内的“原始”二进制zip对象中的数据？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2