对比只能适用于至少有两个等级的因素

问题描述 投票:0回答:1

我想用线性回归预测销售额。这是我用于建模的数据表。

> store
     Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
  1:     3  8314               14130                        12                     2006      1              14            2011            1
  2:     3  8977               14130                        12                     2006      1              14            2011            1
  3:     3  7610               14130                        12                     2006      1              14            2011            1
  4:     3  8864               14130                        12                     2006      1              14            2011            1
  5:     3  8107               14130                        12                     2006      1              14            2011            1
 ---                                                                                                                                       
775:     3 12247               14130                        12                     2006      1              14            2011            1
776:     3  4523               14130                        12                     2006      1              14            2011            1
777:     3  6069               14130                        12                     2006      1              14            2011            1
778:     3  5902               14130                        12                     2006      1              14            2011            1
779:     3  6823               14130                        12                     2006      1              14            2011            1
     Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
  1:            0            0           1           0           0           0         5    1     1             1     2015         7
  2:            0            0           1           0           0           0         4    1     1             1     2015         7
  3:            0            0           1           0           0           0         3    1     1             1     2015         7
  4:            0            0           1           0           0           0         2    1     1             1     2015         7
  5:            0            0           1           0           0           0         1    1     1             1     2015         7
 ---                                                                                                                                
775:            0            0           1           0           0           0         1    1     1             0     2013         1
776:            0            0           1           0           0           0         6    1     0             0     2013         1
777:            0            0           1           0           0           0         5    1     0             1     2013         1
778:            0            0           1           0           0           0         4    1     0             1     2013         1
779:            0            0           1           0           0           0         3    1     0             1     2013         1
     DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
  1:      31       30              1              0              0              0             103     52.00              1          0
  2:      30       30              1              0              0              0             103     52.00              1          0
  3:      29       30              1              0              0              0             103     52.00              1          0
  4:      28       30              1              0              0              0             103     52.00              1          0
  5:      27       30              1              0              0              0             103     52.00              1          0
 ---                                                                                                                                 
775:       7        1              1              0              0              0              73     20.75              1          0
776:       5        0              1              0              0              0              73     20.50              1          0
777:       4        0              1              0              0              0              73     20.50              1          0
778:       3        0              1              0              0              0              73     20.50              1          0
779:       2        0              1              0              0              0              73     20.50              1          0
> 

因为我得到了错误

对比只能适用于至少有两个等级的因素

我应用@Scott所说的here,因为我没有任何NA值。

我需要知道哪些列应该作为模型中的因子变量转换。

  > lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"

$Sales
[1] "numeric"

$CompetitionDistance
[1] "14130"

$CompetitionOpenSinceMonth
[1] "12"

$CompetitionOpenSinceYear
[1] "2006"

$Promo2
[1] "1"

$Promo2SinceWeek
[1] "14"

$Promo2SinceYear
[1] "2011"

$Assortment_a
[1] "1"

$Assortment_b
[1] "0"

$Assortment_c
[1] "0"

$StoreType_a
[1] "1"

$StoreType_b
[1] "0"

$StoreType_c
[1] "0"

$StoreType_d
[1] "0"

$DayOfWeek
[1] "1"

$Open
[1] "1"

$Promo
[1] "0"

$SchoolHoliday
[1] "0"

$DateYear
[1] "numeric"

$DateMonth
[1] "numeric"

$DateDay
[1] "numeric"

$DateWeek
[1] "numeric"

$StateHoliday_0
[1] "1"

$StateHoliday_a
[1] "0"

$StateHoliday_b
[1] "0"

$StateHoliday_c
[1] "0"

$CompetitionOpen
[1] "numeric"

$PromoOpen
[1] "numeric"

$IspromoinSales
[1] "numeric"

$Prediction
[1] "numeric"

然后我的模型如下所示。看看lm函数我该如何编写它。

M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,]  # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store))  # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10) 
{    # 10-fold cross-validation
  sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)  # 10 % de la totalité de la base est sélectionné
  test <- store[shuffledIndices[sampleIndex],]  # il est utilisé comme base de test
  train <- store[shuffledIndices[-sampleIndex],]  # il est utilisé comme base de train
  modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) + 
                 as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
                 as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
                 as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
                 as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train)  # a linear model is fitted to the training set
  store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
  M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}

plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")

但我总是得到错误。这真的很奇怪。你怎么解释这个?先感谢您

r
1个回答
2
投票

问题是模型中有常量变量。这些变量不会添加信息,因此应从建模过程中排除。 为什么?您希望在给出所有其他变量的情况下为Sales建模。由于某些变量是常量,因此它们不会提供有关Sales如何更改的任何信息,因为这些变量不会更改。

如果您按以下方式修改模型,您的代码应该有效:

modell <- lm(Sales ~ as.factor(DayOfWeek) + SchoolHoliday + as.factor(Promo) + 
               as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + 
               as.factor(DateWeek) + as.factor(CompetitionOpen) + as.factor(PromoOpen), 
             data = train)

另外一句话: 您正在将所有变量转换为因子。例如,PromoOpen似乎是一个数字变量,将此变量保持为数字可能更好。这当然取决于您的数据和模型的理想解释。

© www.soinside.com 2019 - 2024. All rights reserved.