续 - GLM适合(逻辑回归)到SQL

问题描述 投票:1回答:2

正如Tomas Greif所述:GLM fit (logistic regression) to SQL

Original Question:

我们经常在数据库中直接对数据进行评分,以获得线性或逻辑回归等简单模型。将所有系数从R正确传输到SQL总是有点棘手。我以为我可以为glm结果做一些R到SQL的翻译。对于数值变量,这非常简单:

library(rpart)

fit <- glm(Kyphosis ~ ., data = kyphosis, family = binomial())

coefs <- fit$coef[2:length(fit$coef)]
expr <- paste0('1/(1 + exp(-(',fit$coef[1], '+', paste0('(', 
           coefs, '*', names(coefs), ')', collapse = '+'),')))')

print(expr)

a <- with(kyphosis, eval(parse(text = expr)))
b <- predict(fit, kyphosis, type = 'response')
names(b) <- NULL
all.equal(a, b)

生成的expr是:

1/(1 + exp(-(-2.03693352129613+      (0.0109304821420485*Age)+   (0.410601186932733*Number)+(-0.206510049753697*Start)))).

有没有办法让这个因子变量工作?我想把因素包括在内...当......然后......结束条款。假设我们有以下模型:

kyphosis$factor_variable <- rep(LETTERS[1:5],20)[1:81]
fit <- glm(Kyphosis ~ ., data = kyphosis, family = binomial())

我正在浏览适合的结构,但看不到任何有用的东西。是解析名称的唯一选项(适合$ coef)?

这是迄今为止最佳答案的参考... https://stackoverflow.com/a/33659431/6497137

Potential Solution

glm_to_sql <- function(glmmodel) {
  xlev <- data.frame(unlist(glmmodel$xlevels))
  xlev$xlevrowname <- rownames(xlev)
  rownames(xlev) <- NULL
  colnames(xlev)[1] <- "xlevel"
  if (nrow(xlev)==0){xlev <- data.frame(xlevrowname=character(0), xlevel=character(0), stringsAsFactors=F)}

  modcoeffs <- data.frame(unlist(glmmodel$coefficients))
  modcoeffs$coeffname <- rownames(modcoeffs)
  rownames(modcoeffs) <- NULL
  colnames(modcoeffs)[1] <- "coeffvalue"

  coeffmatrix <- sqldf("select a.*,b.*,'' as sqlstr, 
                       substr(coeffname,1,instr(coeffname, xlevel)-1) as varname 
                       from modcoeffs a left join xlev b on coeffname like '%' || xlevel and xlevrowname like substr(coeffname,1,instr(coeffname, xlevel)-1) || '%'")

  for (i in 1:nrow(coeffmatrix)) {
    if(coeffmatrix$coeffname[i] == "(Intercept)") 
    {
      coeffmatrix$sqlstr[i] <- coeffmatrix$coeffvalue[i]
    } else if (is.na(coeffmatrix$xlevel[i]) ) {    
      coeffmatrix$sqlstr[i] <- paste("(",coeffmatrix$coeffvalue[i],"*",coeffmatrix$coeffname[i],")")
    } else {
      coeffmatrix$sqlstr[i] <- paste("(case when ",coeffmatrix$varname[i],"='",coeffmatrix$xlevel[i], "' THEN ",coeffmatrix$coeffvalue[i]," ELSE 0 END)",sep="")
    }

    if (i==1){x.sql0 <- coeffmatrix$sqlstr[i]} else {x.sql0 <- paste(x.sql0,"+",coeffmatrix$sqlstr[i])}
  }

  if (glmmodel$family$link == "logit") {
    x.sql <- paste("1/(1 + exp(-(",x.sql0,")))")  
  } else if (glmmodel$family$link == "identity") {
    x.sql <- x.sql0
  }

  return(x.sql)
}

Problem

sqldf连接并不完美:

where varname is null or length(varname) >0 ## additional filter  

这并没有摆脱所有的角落。如果变量在“n”中结束(即人)并且另一个变量(即,surivor)是y / n,那么它将从人中减去“n”并将其与所有其他y / n变量配对。

有没有人有解决方案的潜在工作?

EDIT: Example

library(sqldf)
ID <- seq(1,50,  1)

cabin <- as.numeric(as.character((seq(1,25.5,  .5))))

str(cabin)

Defect <-     c(1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0,1,0,1,0,1,1,0,0,0,1,0,1,0,0,0,0,0)

Pre_register <- c("Y", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y",
             "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y",
             "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", 
             "Y", "N", "N", "Y", "N", "N", "Y", "N")

length(Pre_register)
length(cabin)
length(ID)

x <- data.frame(cbind(ID, cabin, Pre_register, Defect))

x$cabin <- as.numeric(as.character(x$cabin))

str(x)

glm_ex <- glm(Defect ~ cabin + Pre_register ,
           data=x,
           family=binomial(link="logit"))

summary(glm_ex)

这是输出:

> glm_to_sql(glm_ex)

[1]“1 /(1 + exp( - (0.97216 +)(当FLT_REV_Jan_Sep_2015 ='Y'然后回合(-1.95327,3)ELSE 0 END时的情况)+(当''N'那么回合时的情况(-1.93112,3 )ELSE 0 END))))“

注意case语句的空白等于“N”。这件事是错误的,并且是glm_to_sql逻辑的问题。

客舱以“n”结束的连接与Y / N混合在一起。这是一个小得多的例子。

EDIT2:

遍历glm_to_sql:

xlev <- data.frame(unlist(glm_ex$xlevels))

xlev$xlevrowname <- rownames(xlev)

rownames(xlev) <- NULL

colnames(xlev)[1] <- "xlevel"

if (nrow(xlev)==0){xlev <- data.frame(xlevrowname=character(0), xlevel=character(0), stringsAsFactors=F)}

xlev

modcoeffs <- data.frame(unlist(glm_ex$coefficients))

modcoeffs$coeffname <- rownames(modcoeffs)

rownames(modcoeffs) <- NULL

colnames(modcoeffs)[1] <- "coeffvalue"

modcoeffs

这是存在问题的地方:

coeffmatrix <- sqldf("select a.*,b.*,'' as sqlstr, 
                   substr(coeffname,1,instr(coeffname, xlevel)-1) as varname 
                 from modcoeffs a left join xlev b on coeffname like '%' || xlevel and xlevrowname like substr(coeffname,1,instr(coeffname, xlevel)-1) || '%'")

输出:

   coeffvalue     coeffname xlevel   xlevrowname sqlstr      varname
1 -0.51243845   (Intercept)   <NA>          <NA>                <NA>
2 -0.04240967         cabin      N Pre_register1                    
3  1.17625756 Pre_registerY      Y Pre_register2        Pre_register

输出的第2行存在问题 - 舱室与Y / N的Pre_register级别的Y / N相关联,并且字母n中的舱室结束变为级别。

sql r teradata glm sqldf
2个回答
2
投票

既然你提到你正在使用Teradata,有一种简单的方法可以做到这一点,虽然它可能不适用于你。只需在服务器上直接运行R中的评分代码即可。

# fit the logistic regression model (or any other model)
modLR <- glm(Kyphosis ~ Age + Number + Start, data=kyphosis,
             family=binomial)

connStr <- "insert_ODBC_connection_string_here"

# input and output tables
inTbl <- RxTeradata("input_table_name", connectionString=connStr)
outTbl <- RxTeradata("output_table_name", connectionString=connStr)

# set the compute context to in-DB
ccTD <- RxInTeradata(connectionString=connStr)
rxSetComputeContext(ccTD)

# do the scoring
rxDataStep(inTbl, outTbl,
           transforms=list(
               pred=predict(.modLR, data.frame(Age, Number, Start))
           ),
           transformObjects=list(.modLR=modLR),
           transformPackages="stats")  # or rpart, randomForest, gbm, etc

这适用于本地桌面/笔记本电脑上的模型,然后将其发送到服务器上运行的R进程。评分完全在服务器上进行,没有数据移动回桌面。

如果模型涉及因子,则可以通过将因子构建为预测调用的一部分来(相对)轻松处理:

rxDataStep(inTbl, outTbl,
           transforms=list(
               pred=predict(.modLR,
                   data.frame(Age, Number=factor(Number, levels=2:10), Start))
           ),
           transformObjects=list(.modLR=modLR))

进行设置以便正确处理元素级别等元数据有点单调乏味;我已经省略了细节,但希望你能看到它是如何完成的。

这需要在Teradata盒子上安装Revolution / Microsoft R Server。既然你问这个问题,我怀疑没有安装MRS(或者你已经在使用它)。尽管如此,我把它放在这里是因为它可以帮助Teradata的其他人看到这个问题。

同样的解决方案也适用于Microsoft SQL Server。当Revo是一家独立公司时,我们支持Teradata,这种支持在收购后不会消失。

披露:我为微软工作。


0
投票

您好另一个选择是您可以使用R构建GLM并使用glm.deploy包生成等效的源代码,https://cran.r-project.org/web/packages/glm.deploy/index.html您可以使用C或JAVA生成GLM代码,并将其更轻松地转换为SQL或构建用户定义的函数对于特定的DBMS

© www.soinside.com 2019 - 2024. All rights reserved.