我有兴趣检索有关 R 包的机器可读元信息。
例如,当我访问 CRAN 时,我可以在下载之前看到有关该包的简短描述:https://cran.r-project.org/web/packages/MASS/
我找不到任何方法从 CRAN 服务器检索与 HTML 不同的输出。我想避免解析 HTML,而是以更方便的格式(例如 JSON)检索有关包的元信息。
我看到每个R包(至少据我所知)在其源代码包内都有一个类似yaml(?)的描述文本(该文件称为
DESCRIPTION
)。然而,到目前为止我只能在 tar 档案中找到这种描述,这意味着我必须先下载该包才能访问其描述。
这里是 MASS 包中的
DESCRIPTION
的示例:
Package: MASS
Priority: recommended
Version: 7.3-55
Date: 2022-01-12
Revision: $Rev: 3559 $
Depends: R (>= 3.3.0), grDevices, graphics, stats, utils
Imports: methods
Suggests: lattice, nlme, nnet, survival
Authors@R: c(person("Brian", "Ripley", role = c("aut", "cre", "cph"),
email = "[email protected]"),
person("Bill", "Venables", role = "ctb"),
person(c("Douglas", "M."), "Bates", role = "ctb"),
person("Kurt", "Hornik", role = "trl",
comment = "partial port ca 1998"),
person("Albrecht", "Gebhardt", role = "trl",
comment = "partial port ca 1998"),
person("David", "Firth", role = "ctb"))
Description: Functions and datasets to support Venables and Ripley,
"Modern Applied Statistics with S" (4th edition, 2002).
Title: Support Functions and Datasets for Venables and Ripley's MASS
LazyData: yes
ByteCompile: yes
License: GPL-2 | GPL-3
URL: http://www.stats.ox.ac.uk/pub/MASS4/
Contact: <[email protected]>
NeedsCompilation: yes
Packaged: 2022-01-13 05:06:37 UTC; ripley
Author: Brian Ripley [aut, cre, cph],
Bill Venables [ctb],
Douglas M. Bates [ctb],
Kurt Hornik [trl] (partial port ca 1998),
Albrecht Gebhardt [trl] (partial port ca 1998),
David Firth [ctb]
Maintainer: Brian Ripley <[email protected]>
Repository: CRAN
Date/Publication: 2022-01-13 08:05:04 UTC
有什么建议如何直接以机器可读且方便的形式获得它吗?
我试图查找它,但搜索引擎到目前为止没有给我带来任何有用的结果。
编辑/澄清:我正在寻找一种不依赖于R的解决方案,而是一个不依赖于元数据检索所使用的框架/语言的Web API。
tools::CRAN_package_db()
有您想要的所有信息吗? (请参阅此处进行一些讨论)
> dd <- tools::CRAN_package_db()
> names(dd)
[1] "Package" "Version"
[3] "Priority" "Depends"
[5] "Imports" "LinkingTo"
[7] "Suggests" "Enhances"
[9] "License" "License_is_FOSS"
[11] "License_restricts_use" "OS_type"
[13] "Archs" "MD5sum"
[15] "NeedsCompilation" "Additional_repositories"
[17] "Author" "Authors@R"
[19] "Biarch" "BugReports"
[21] "BuildKeepEmpty" "BuildManual"
[23] "BuildResaveData" "BuildVignettes"
[25] "Built" "ByteCompile"
[27] "Classification/ACM" "Classification/ACM-2012"
[29] "Classification/JEL" "Classification/MSC"
[31] "Classification/MSC-2010" "Collate"
[33] "Collate.unix" "Collate.windows"
[35] "Contact" "Copyright"
[37] "Date" "Description"
[39] "Encoding" "KeepSource"
[41] "Language" "LazyData"
[43] "LazyDataCompression" "LazyLoad"
[45] "MailingList" "Maintainer"
[47] "Note" "Packaged"
[49] "RdMacros" "StagedInstall"
[51] "SysDataCompression" "SystemRequirements"
[53] "Title" "Type"
[55] "URL" "UseLTO"
[57] "VignetteBuilder" "ZipData"
[59] "Published" "Path"
[61] "X-CRAN-Comment" "Reverse depends"
[63] "Reverse imports" "Reverse linking to"
[65] "Reverse suggests" "Reverse enhances"
我要补充一点,虽然第一步确实需要 R,但您可以轻松生成 JSON 文件并将其存储在本地以供其他机器使用:
library(jsonlite)
(tools::CRAN_package_db()
|> jsonlite::toJSON()
|> writeLines("R_packages.json")
)
(这会生成一个 30Mb 的文件,没有换行符,但我认为它应该仍然可用......)
一个可接受的解决方案是 METACRAN API,可在此处获取: https://crandb.r-pkg.org/
您可以下载https://cloud.r-project.org/src/contrib/PACKAGES.gz(甚至以未压缩的形式https://cloud.r-project.org/src/contrib/PACKAGES )。它包含有关 DCF 格式的所有当前可用包的信息,使用描述文件中的一些字段以及其他一些字段。
您不需要使用
cloud.r-project.org
,任何 CRAN 镜像都可以。