R 新手(在工作中学习)在尝试修改以前程序员的回归时遇到错误

问题描述 投票:0回答:1

我是一名研究人员,第一次在工作中运行二项式回归(以及编码和统计)——这是一次经历!我中途接手了这个项目,所以没有自己开发最初的编码。我以前从未编码过,所以我一直在学习 R。然而,我遇到了一个我无法弄清楚的错误问题(尽管我怀疑它可能非常简单),并且非常感谢任何帮助。我在下面详细介绍了它,如果有帮助的话可以附上屏幕截图。

初始数据集有 1,276 个人(行),每个人回答 188 个问题(列)中的选择。此后,我被要求向这个初始数据集添加另外 8 个问题的答案,这意味着最终数据集有 196 个问题(列)。总体而言,只有 9 列,而且这一点保持不变。但是,我在调整代码以考虑添加这些新列时遇到问题。

欢迎任何关于可能导致行不匹配的想法!

例如,我的第一个代码将运行:

Ans_Data = read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GG1290", col_names = F, col_types = c("text",rep("numeric",188)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GG10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:189])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

但是,我更新的代码:

Ans_Data = read_xlsx("DSM Data 15.2.23 DP v5.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GO1287", col_names = F,col_types = c("text",rep("numeric",196)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 DP v5.xlsx", 
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GO10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:197])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

产生以下错误:

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 250096, 250684

编辑添加以下命令的输出(在 for 循环之前):dput(Question_Data[1:10,1:10]) 和 dput(Ans_Data[1:10, 1:10])。

library(data.table)
Question_Data <- structure(list(Question_ID = c("sawd4_batch2", "sawd3_batch3", 
"sand4_batch", "samd3", "samd32", "bwpx_batch", "bwd3", "bwd32", 
"bmd3_batch5", "bm3_batch2"), `Media Item Subtype` = c("Image", 
"Image", "Image", "Image", "Image", "Image", "Image", "Image", 
"Image", "Image"), `Contains Synthetic Media?` = c("Yes (Fully Synthetic)", 
"Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)", 
"Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)", 
"Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)"
), `Real/Fake Image` = c("Fake", "Fake", "Fake", "Fake", "Fake", 
"Fake", "Fake", "Fake", "Fake", "Fake"), `Real/Fake Audio` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    `Real/Fake Video` = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), `Type of Image` = c("Human", 
    "Human", "Human", "Human", "Human", "Human", "Human", "Human", 
    "Human", "Human"), `Human or Non Human` = c("Human", "Human", 
    "Human", "Human", "Human", "Human", "Human", "Human", "Human", 
    "Human"), `Language Type` = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -10L), class = "data.frame")
Ans_Data <- structure(list(...1 = c("53987712fdf99b68e3a45021", "545cee6dfdf99b7f9e3254ce", 
"5484739ffdf99b0379939c95", "5588ee6ffdf99b304dd48297", "558943fafdf99b5ccd435cb3", 
"5589c7cefdf99b18bd86cf31", "558a035bfdf99b2d75651378", "558a327cfdf99b2d75651681", 
"558bbd56fdf99b2127e1f359", "5591827dfdf99b4fccbdfb21"), ...2 = c(NA, 
NA, NA, 1, NA, NA, 1, NA, NA, NA), ...3 = c(NA, NA, NA, 0, NA, 
NA, 0, NA, NA, NA), ...4 = c(NA, NA, NA, 1, NA, NA, 0, NA, NA, 
NA), ...5 = c(NA, NA, NA, 1, NA, NA, 0, NA, NA, NA), ...6 = c(NA, 
NA, NA, 1, NA, NA, 1, NA, NA, NA), ...7 = c(NA, NA, NA, 0, NA, 
NA, 0, NA, NA, NA), ...8 = c(NA, NA, NA, 1, NA, NA, 0, NA, NA, 
NA), ...9 = c(NA, NA, NA, 0, NA, NA, 0, NA, NA, NA), ...10 = c(NA, 
NA, NA, 0, NA, NA, 0, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))
r logistic-regression
1个回答
1
投票

所以我写了一个答案(还不能发表评论)。偶然得到你的代码,它以某种方式引起了我的注意。不管怎样,你的错误非常简单。你试图“列绑定”(cbind)或将两个具有不同行数的数据帧绑定在一起。现在这是另一个问题。

因此,阅读代码时,您会导入两个数据集:

Ans_Data = read_xlsx("DSM Data 15.2.23 DP v5.xlsx", sheet = "CHANGED Tab 2 - AR weighted", range = "A12:GO1287", col_names = F,col_types = c("text",rep("numeric",196)))

Question_Data = t(read_xlsx("DSM Data 15.2.23 DP v5.xlsx", sheet = "CHANGED Tab 2 - AR weighted", range = "A1:GO10", col_names = T)).

根据数据集的命名,我假设 Ans_Data 是响应;这是一个包含 197 列(A 到 GO)和 1276 行(12 到 1287)的数据集。您稍后将该数据框转换为长格式;在您的情况下,创建一个包含 250096 行的数据框。这是 196(从 2:197)列乘以 1276 行得出的结果。

第二个数据集 (Question_Data) 是一个已转置 (t) 10 列和 197 行的数据框。您可以使用该数据帧的第一行作为列名并将其排除,留下 196 行。您稍后运行一个循环,对于 i = 1 的情况,将 196 行复制(行绑定)到 Question_Data 数据帧的末尾,从而产生 392 行。您可以对情况 i > 1 1277 次重复该过程。因此,生成的数据帧 Question_Data 有 392 + 196 * 1277 或 250684 行。

您的数据集有 250096 和 250684 行;正如前面提到的,cbind 给出了一个错误。假设 Question_Data 给出设计矩阵和 Ans_Data 响应,则代码可能是为了将设计矩阵与响应相匹配而构建的。鉴于您需要来自 1276 个人的 196 条回复,这应该是 250096 行(196 乘以 1276)。所以我建议你循环的序列太长,应该是1:1274?

© www.soinside.com 2019 - 2024. All rights reserved.