R 3.4.1-RSiteCatalyst入队报告的while循环的智能使用

问题描述 投票:4回答:1

实际

我已经使用RSiteCatalyst包已有一段时间了。对于那些不了解它的人,它使通过API从Adobe Analytics获取数据的过程变得更加容易。

直到现在,工作流程如下:

  1. 例如,发出请求:
    key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                   metrics = c("pageviews"), date.granularity = "month",
                   max.attempts = 500, interval.seconds = 20) 
  1. 等待响应,该响应将另存为data.frame(示例结构:):

    > View(head(key_metrics,1)) 
        datetime      name         year   month   day    pageviews 
      1 2015-07-01    July 2015    2015   7       1      45825
    
  2. 做一些数据转换(例如:

    key_metrics$datetime <- as.Date(key_metrics$datetime)

此工作流程的问题是,有时(由于请求的复杂性),我们可以等待很多时间,直到响应最终到来。如果R脚本包含40-50个相同复杂的API请求,则意味着我们将等待40-50次,直到数据最终到来并且我们可以执行新请求。显然,这在我的ETL流程中产生了小问题。

目标

但是,该软件包的大多数功能中都有一个参数enqueueOnly,它告诉Adobe在传递报告ID作为响应的同时处理请求:

key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
               metrics = c("pageviews"), date.granularity = "month",
               max.attempts = 500, interval.seconds = 20,
               enqueueOnly = TRUE)

> key_metrics
[1] 1154642436 

我可以通过使用以下函数随时获得“真实”响应(带有数据):

key_metrics <- GetReport(key_metrics)

在每个请求中,我在生成报告ID和报告名称的列表时添加参数enqueueOnly = TRUE

queueFromIds <- c(queueFromIds, key_metrics)
queueFromNames <- c(queueFromNames, "key_metrics")

这种方法最重要的区别是Adobe会同时处理我的所有请求,因此大大减少了等待时间。

问题

但是,通过有效获取数据,我遇到了问题。我正在尝试使用while循环,一旦获得数据,该循环将从先前的向量中删除密钥ID和密钥名称:

while (length(queueFromNames)>0)
{
  assign(queueFromNames[1], GetReport(queueFromIds[1],
                                      max.attempts = 3,
                                      interval.seconds = 5))
  queueFromNames <- queueFromNames[-1]
  queueFromIds <- queueFromIds[-1]
}

但是,仅当请求足够简单以至于可以在几秒钟内处理后,此方法才起作用。当请求足够复杂以至于无法以5秒的间隔进行3次尝试时,循环将停止,并显示以下错误:

ApiRequest中的错误(body = toJSON(request.body),func.name =“ Report.Get” ,:错误:超过最大尝试次数https://api3.omniture.com/admin/1.4/rest/?method=Report.Get

哪些功能可以帮助我控制所有API请求均得到正确处理,并且在最佳情况下,需要额外时间(它们会产生错误)的API请求会被跳过,直到循环结束为止,再次被请求?

r error-handling while-loop adobe-analytics
1个回答
1
投票

我使用了几个函数来独立生成/检索报告ID。这样,处理报告需要多长时间都没关系。我通常会在生成报告ID后12小时回来找他们。我认为它们会在48小时左右后过期。这些功能当然依赖于RSiteCatalyst。这些是功能:

#' Generate report IDs to be retrieved later
#'
#' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
#' @param suite Report suite ID.
#' @param dateBegin Start date in the following format: YYYY-MM-DD.
#' @param dateFinish End date in the following format: YYYY-MM-DD.
#' @param metrics Vector containing up to 30 required metrics IDs.
#' @param elements Vector containing element IDs.
#' @param classification Vector containing classification IDs.
#'@param valueStart Integer value pointing to row to start report with.
#' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
#' @examples
#' \dontrun{
#' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
#'}
#' @export
    reportsGenerator <- function(suite,
                                 dateBegin,
                                 dateFinish,
                                 metrics,
                                 elements,
                                 classification,
                                 valueStart) {

      #Convert dates to date format.
      #Deduct one from dateBegin to
      #neutralize the initial +1 in the loop.

      dateBegin <-  as.Date(dateBegin, "%Y-%m-%d") - 1
      dateFinish <-  as.Date(dateFinish, "%Y-%m-%d")
      timeRange <- dateFinish - dateBegin

      #Create data frame to store dates and report IDs
      VisitorActivityReports <-
        data.frame(matrix(NA, nrow = timeRange, ncol = 2))
      names(VisitorActivityReports) <- c("Date", "ReportID")

      #Run a loop to retrieve one ReportID for each day in the time period.
      for (i in 1:timeRange) {
        dailyDate <- as.character(dateBegin + i)
        print(i) #Visibility to end user
        print(dailyDate) #Visibility to end user
        VisitorActivityReports[i, 1] <- dailyDate


        VisitorActivityReports[i, 2] <-
          RSiteCatalyst::QueueTrended(
            reportsuite.id = suite,
            date.from = dailyDate,
            date.to = dailyDate,
            metrics = metrics,
            elements = elements,
            classification = classification,
            top = 50000,
            max.attempts = 500,
            start = valueStart,
            enqueueOnly = T
          )
      }
      return(VisitorActivityReports)
    }

您应将前一个函数的输出分配给变量。然后使用该变量作为以下函数的输入。还要将reportsRetriever的结果分配给一个变量。输出将是一个数据框。只要它们共享相同的结构,该函数就会将所有报告一起rbind。不要尝试合并具有不同结构的报告。

#' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
#'
#' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
#' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
#' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
#' @examples
#' \dontrun{
#' visitorActivity <- reportsRetriever(dataFrameReports)
#'}
#'
#' @export    

reportsRetriever <- function(dataFrameReports) {

      visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
      visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))

      #Validate report integrity

      if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
        print("Ok. All reports available")
        return(visitor.activity.df)
      } else {
        print("Some reports may have been missed.")
        missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)

        return(visitor.activity.df)
      }

    }
© www.soinside.com 2019 - 2024. All rights reserved.