正如Tomas Greif所述:GLM fit (logistic regression) to SQL
我们经常在数据库中直接对数据进行评分,以获得线性或逻辑回归等简单模型。将所有系数从R正确传输到SQL总是有点棘手。我以为我可以为glm结果做一些R到SQL的翻译。对于数值变量,这非常简单:
library(rpart)
fit <- glm(Kyphosis ~ ., data = kyphosis, family = binomial())
coefs <- fit$coef[2:length(fit$coef)]
expr <- paste0('1/(1 + exp(-(',fit$coef[1], '+', paste0('(',
coefs, '*', names(coefs), ')', collapse = '+'),')))')
print(expr)
a <- with(kyphosis, eval(parse(text = expr)))
b <- predict(fit, kyphosis, type = 'response')
names(b) <- NULL
all.equal(a, b)
生成的expr是:
1/(1 + exp(-(-2.03693352129613+ (0.0109304821420485*Age)+ (0.410601186932733*Number)+(-0.206510049753697*Start)))).
有没有办法让这个因子变量工作?我想把因素包括在内...当......然后......结束条款。假设我们有以下模型:
kyphosis$factor_variable <- rep(LETTERS[1:5],20)[1:81]
fit <- glm(Kyphosis ~ ., data = kyphosis, family = binomial())
我正在浏览适合的结构,但看不到任何有用的东西。是解析名称的唯一选项(适合$ coef)?
这是迄今为止最佳答案的参考... https://stackoverflow.com/a/33659431/6497137
glm_to_sql <- function(glmmodel) {
xlev <- data.frame(unlist(glmmodel$xlevels))
xlev$xlevrowname <- rownames(xlev)
rownames(xlev) <- NULL
colnames(xlev)[1] <- "xlevel"
if (nrow(xlev)==0){xlev <- data.frame(xlevrowname=character(0), xlevel=character(0), stringsAsFactors=F)}
modcoeffs <- data.frame(unlist(glmmodel$coefficients))
modcoeffs$coeffname <- rownames(modcoeffs)
rownames(modcoeffs) <- NULL
colnames(modcoeffs)[1] <- "coeffvalue"
coeffmatrix <- sqldf("select a.*,b.*,'' as sqlstr,
substr(coeffname,1,instr(coeffname, xlevel)-1) as varname
from modcoeffs a left join xlev b on coeffname like '%' || xlevel and xlevrowname like substr(coeffname,1,instr(coeffname, xlevel)-1) || '%'")
for (i in 1:nrow(coeffmatrix)) {
if(coeffmatrix$coeffname[i] == "(Intercept)")
{
coeffmatrix$sqlstr[i] <- coeffmatrix$coeffvalue[i]
} else if (is.na(coeffmatrix$xlevel[i]) ) {
coeffmatrix$sqlstr[i] <- paste("(",coeffmatrix$coeffvalue[i],"*",coeffmatrix$coeffname[i],")")
} else {
coeffmatrix$sqlstr[i] <- paste("(case when ",coeffmatrix$varname[i],"='",coeffmatrix$xlevel[i], "' THEN ",coeffmatrix$coeffvalue[i]," ELSE 0 END)",sep="")
}
if (i==1){x.sql0 <- coeffmatrix$sqlstr[i]} else {x.sql0 <- paste(x.sql0,"+",coeffmatrix$sqlstr[i])}
}
if (glmmodel$family$link == "logit") {
x.sql <- paste("1/(1 + exp(-(",x.sql0,")))")
} else if (glmmodel$family$link == "identity") {
x.sql <- x.sql0
}
return(x.sql)
}
sqldf连接并不完美:
where varname is null or length(varname) >0 ## additional filter
这并没有摆脱所有的角落。如果变量在“n”中结束(即人)并且另一个变量(即,surivor)是y / n,那么它将从人中减去“n”并将其与所有其他y / n变量配对。
有没有人有解决方案的潜在工作?
library(sqldf)
ID <- seq(1,50, 1)
cabin <- as.numeric(as.character((seq(1,25.5, .5))))
str(cabin)
Defect <- c(1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0,1,0,1,0,1,1,0,0,0,1,0,1,0,0,0,0,0)
Pre_register <- c("Y", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y",
"N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y",
"Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "N", "Y",
"Y", "N", "N", "Y", "N", "N", "Y", "N")
length(Pre_register)
length(cabin)
length(ID)
x <- data.frame(cbind(ID, cabin, Pre_register, Defect))
x$cabin <- as.numeric(as.character(x$cabin))
str(x)
glm_ex <- glm(Defect ~ cabin + Pre_register ,
data=x,
family=binomial(link="logit"))
summary(glm_ex)
这是输出:
> glm_to_sql(glm_ex)
[1]“1 /(1 + exp( - (0.97216 +)(当FLT_REV_Jan_Sep_2015 ='Y'然后回合(-1.95327,3)ELSE 0 END时的情况)+(当''N'那么回合时的情况(-1.93112,3 )ELSE 0 END))))“
注意case语句的空白等于“N”。这件事是错误的,并且是glm_to_sql逻辑的问题。
客舱以“n”结束的连接与Y / N混合在一起。这是一个小得多的例子。
遍历glm_to_sql:
xlev <- data.frame(unlist(glm_ex$xlevels))
xlev$xlevrowname <- rownames(xlev)
rownames(xlev) <- NULL
colnames(xlev)[1] <- "xlevel"
if (nrow(xlev)==0){xlev <- data.frame(xlevrowname=character(0), xlevel=character(0), stringsAsFactors=F)}
xlev
modcoeffs <- data.frame(unlist(glm_ex$coefficients))
modcoeffs$coeffname <- rownames(modcoeffs)
rownames(modcoeffs) <- NULL
colnames(modcoeffs)[1] <- "coeffvalue"
modcoeffs
这是存在问题的地方:
coeffmatrix <- sqldf("select a.*,b.*,'' as sqlstr,
substr(coeffname,1,instr(coeffname, xlevel)-1) as varname
from modcoeffs a left join xlev b on coeffname like '%' || xlevel and xlevrowname like substr(coeffname,1,instr(coeffname, xlevel)-1) || '%'")
输出:
coeffvalue coeffname xlevel xlevrowname sqlstr varname
1 -0.51243845 (Intercept) <NA> <NA> <NA>
2 -0.04240967 cabin N Pre_register1
3 1.17625756 Pre_registerY Y Pre_register2 Pre_register
输出的第2行存在问题 - 舱室与Y / N的Pre_register级别的Y / N相关联,并且字母n中的舱室结束变为级别。
既然你提到你正在使用Teradata,有一种简单的方法可以做到这一点,虽然它可能不适用于你。只需在服务器上直接运行R中的评分代码即可。
# fit the logistic regression model (or any other model)
modLR <- glm(Kyphosis ~ Age + Number + Start, data=kyphosis,
family=binomial)
connStr <- "insert_ODBC_connection_string_here"
# input and output tables
inTbl <- RxTeradata("input_table_name", connectionString=connStr)
outTbl <- RxTeradata("output_table_name", connectionString=connStr)
# set the compute context to in-DB
ccTD <- RxInTeradata(connectionString=connStr)
rxSetComputeContext(ccTD)
# do the scoring
rxDataStep(inTbl, outTbl,
transforms=list(
pred=predict(.modLR, data.frame(Age, Number, Start))
),
transformObjects=list(.modLR=modLR),
transformPackages="stats") # or rpart, randomForest, gbm, etc
这适用于本地桌面/笔记本电脑上的模型,然后将其发送到服务器上运行的R进程。评分完全在服务器上进行,没有数据移动回桌面。
如果模型涉及因子,则可以通过将因子构建为预测调用的一部分来(相对)轻松处理:
rxDataStep(inTbl, outTbl,
transforms=list(
pred=predict(.modLR,
data.frame(Age, Number=factor(Number, levels=2:10), Start))
),
transformObjects=list(.modLR=modLR))
进行设置以便正确处理元素级别等元数据有点单调乏味;我已经省略了细节,但希望你能看到它是如何完成的。
这需要在Teradata盒子上安装Revolution / Microsoft R Server。既然你问这个问题,我怀疑没有安装MRS(或者你已经在使用它)。尽管如此,我把它放在这里是因为它可以帮助Teradata的其他人看到这个问题。
同样的解决方案也适用于Microsoft SQL Server。当Revo是一家独立公司时,我们支持Teradata,这种支持在收购后不会消失。
披露:我为微软工作。
您好另一个选择是您可以使用R构建GLM并使用glm.deploy包生成等效的源代码,https://cran.r-project.org/web/packages/glm.deploy/index.html您可以使用C或JAVA生成GLM代码,并将其更轻松地转换为SQL或构建用户定义的函数对于特定的DBMS