我正在尝试从公共 API 获取一些数据,并且需要一些帮助来确定请求 URL 的正确查询语法。
下面是我的脚本。 (别介意修复或改进功能,到目前为止它运行得很好。)
我需要的是正确的查询URL。
我想从 ClinicalTrials.gov 获取搜索词“EGFR”的临床研究列表,但缩小搜索范围,以便仅返回“总体状态”中包含“招募”或“活跃,未招募”的结果场地。 这里是“OverallStatus”字段的可能值。
我很难弄清楚 API 文档。有一个页面包含 Search Expressions and Syntax,但它们没有解释如何搜索多个值。如何构建查询字符串来搜索字段中的多个可能值?
我很感激任何见解!
library(tidyverse)
library(httr)
library(jsonlite)
library(glue)
get_studies_df <- function(query_url){
# get clinical studies data
res <- httr::GET(query_url)
if(!httr::status_code(res) == 200){
#if request failed return empty data frame
empty_df <- stats::setNames(data.frame(matrix(ncol = 5, nrow = 0)), c("Rank", "NCTId", "Condition", "BriefTitle", "OverallStatus"))
return(empty_df)
}
# get data from response obj
data <- httr::content(res, as="text", encoding = "UTF-8") %>%
jsonlite::fromJSON()
# prepare clinical studies data frame
studies_df <- data$StudyFieldsResponse$StudyFields %>%
# combine conditions if there is more than one
dplyr::rowwise() %>%
mutate(Condition = paste(Condition, collapse = ", ")) %>%
dplyr::ungroup()
# unlist data frame columns to show full length text
for (i in c(1:ncol(studies_df))){
studies_df[,i] <- unlist(studies_df[,i])
}
return(studies_df)
}
### here are all the query strings I tried ###
# get all studies for EGFR (WORKING, but finds 5000+ studies, way too many)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# get "Recruiting" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# get "Active" studies only (WORKING)
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
### I'm trying to get "Recruiting" OR "Active" studies. These are NOT WORKING ###
# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns nothing
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+RANGE[Recruiting,Active]&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns only "Active"
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+AREA[OverallStatus]+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
# returns everything ("Recruiting", "Completed", "Unknown status", "Active, not recruiting") ??
query_url <- "https://ClinicalTrials.gov/api/query/study_fields?expr=EGFR+AREA[OverallStatus]+Recruiting+OR+Active&fields=NCTId,Condition,BriefTitle,OverallStatus&fmt=json"
df <- get_studies_df(query_url)
输出表:
我在尝试了解如何使用经典 API 进行查询时遇到了同样的问题,但我不喜欢他们的文档。也就是说,我发现他们的演示对于在搜索表达式上进行一些试验和错误来构建正确的 url 很有用。
https://classic.clinicaltrials.gov/api/gui/demo/simple_study_fields
为了同时包括招聘和活动,我这样填写了方框:
expr= EGFR 和区域[总体状态][正在招募,活跃]
字段= NCTId、简要标题、条件、总体状态
max_rank = 50,json 格式,我得到了这个:
我没有看到我提取的 50 项研究中有任何一项是“招聘”,不确定我的搜索是否很短,或者这些研究现在还没有公开招聘。您还可以使用代码达到最高排名,然后根据 json 的整体状态值进行过滤。