我正在尝试分析使用 SurveyMonkey 创建的大型调查,该调查在 CSV 文件中有数百列,并且输出格式很难使用,因为标题超过两行。
谢谢!
您可以从 Surveymonkey 以适合 R 的便捷形式导出它,请参阅“高级电子表格格式”中的下载响应
我最后所做的是使用 libreoffice 打印出标有 V1、V2 等的标题,然后我将文件读入为
m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
然后针对 m1$V10、m1$V23 等进行分析...
为了解决多列的混乱问题,我使用了以下小函数
# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
# e.g. mcols(df, c(14:18))
exp <- paste('df[,', cols, ']', sep='', collapse=',' )
# this creates something like...
# "df[,14],df[,15],df[,16],df[,17],df[,18]"
# now we just want to do a paste of this expression...
nexp <- paste(" paste(", exp, ", sep=' ')")
# so now nexp looks something like...
# " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
# now we just need to parse this text... and eval() it...
newcol <- eval(parse(text=nexp))
newcol <- gsub(' *', ' ', newcol) # replace duplicate spaces by a single one
newcol <- gsub('^ *', '', newcol) # remove leading spaces
gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))
毫无疑问有人能够清理这个!
为了整理我使用的类似李克特量表:
# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
y <- ifelse(x == '', NA, x)
ordered(y, levels=xlevels)
}
for (i in 44:52) {
m2[,i] <- tidylik4(m2[,i])
}
请随意发表评论,因为毫无疑问这会再次出现!
我必须经常处理这个问题,并且将标题放在两列上有点痛苦。此函数修复了该问题,以便您只需处理 1 行标题。它还加入了多打孔问题,因此您可以使用顶部:底部样式命名。
#' @param x The path to a surveymonkey csv file
fix_names <- function(x) {
rs <- read.csv(
x,
nrows = 2,
stringsAsFactors = FALSE,
header = FALSE,
check.names = FALSE,
na.strings = "",
encoding = "UTF-8"
)
rs[rs == ""] <- NA
rs[rs == "NA"] <- "Not applicable"
rs[rs == "Response"] <- NA
rs[rs == "Open-Ended Response"] <- NA
nms <- c()
for(i in 1:ncol(rs)) {
current_top <- rs[1,i]
current_bottom <- rs[2,i]
if(i + 1 < ncol(rs)) {
coming_top <- rs[1, i+1]
coming_bottom <- rs[2, i+1]
}
if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
pre <- current_top
if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
if(!is.na(current_top) & is.na(current_bottom))
nms[i] <- current_top
}
nms
}
如果您注意,它只会返回名称。我通常只是使用
...,skip=2, header = FALSE
读取 .csv,保存到变量并覆盖变量的名称。它还可以帮助您设置 na.strings
和 stringsAsFactor = FALSE
。
nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms
截至2013年11月,网页布局似乎发生了变化。选择
Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software)
。然后转到导出并下载文件。您将获得原始数据,第一行 = 问题标题 / 接下来的每一行 = 1 个响应,如果您有很多响应/问题,可能会拆分到多个文件中。
标题的问题是“选择所有适用的”列将有一个空白的顶行,并且列标题将是下面的行。这只是此类问题的问题。
考虑到这一点,我编写了一个循环来遍历所有列,如果列名称为空(字符长度为 1),则将列名称替换为第二行中的值。
然后,你可以删除第二行数据并得到一个整洁的数据框。
for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
}
df <- df[-1,]
迟到了,但这仍然是一个问题,我发现的最佳解决方法是使用一个函数根据重复值将列名称和子列名称粘贴在一起。
例如,如果导出到
.csv
,重复的列名称将在 RStudio 中自动替换为 X
。如果导出到.xlsx
,重复值将为...
。
这是一个
base R
解决方案:
sm_header_function <- function(x, rep_val){
orig <- x
sv <- x
sv <- sv[1,]
sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
sv <- t(sv)
sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
names(sv)[1] <- "name"
names(sv)[2] <- "value"
sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
sv$new_value <- paste0(sv$new_value, " ", sv$value)
new_names <- as.character(sv$new_value)
colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
orig <- orig[-c(1),]
return(orig)
}
sm_header_function(df, "X")
sm_header_function(df, "...")
对于一些示例数据,列名称的更改将如下所示:
SurveyMonkey 的原始导出:
> colnames(sample)
[1] "Respondent ID" "Please provide your contact information:" "...11"
[4] "...12" "...13" "...14"
[7] "...15" "...16" "...17"
[10] "...18" "...19" "I wish it would have snowed more this winter."
从 SurveyMonkey 清理导出:
> colnames(sample_clean)
[1] "Respondent ID" "Please provide your contact information: Name"
[3] "Please provide your contact information: Company" "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2" "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province" "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country" "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number" "I wish it would have snowed more this winter. Response"
样本数据:
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621,
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin",
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale",
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's",
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2",
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia",
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa",
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.",
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104",
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country",
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "[email protected]",
"[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]", "[email protected]"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646",
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944",
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response",
"Strongly disagree", "Strongly agree", "Neither agree nor disagree",
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
以下怎么样:将
read.csv()
与 header=FALSE
一起使用。制作两个数组,一个包含两行标题,另一个包含调查的答案。然后paste()
将两行/句子放在一起。最后,使用colnames()
。
2023 年可行的两种解决方案:
haven()
套餐haven
包Haven 是一个 R 包,可以导入和导出 Stata、SPSS 和 SAS 文件。如果您在从 SurveyMonkey 导出时选择“SPSS”选项,则可以使用 Haven 读取导出内容,这将是一个
.sav
文件:
haven::read_sav()
library(haven)
read_sav("your-file-here.sav")
生成的文件将是 tibble(一种数据帧)。这些列标有“q000N”语法,并且可能包含 Haven 包特有的文件类型,例如具有
haven_labelled
类的数字向量,类似于因子。
我创建了一个单功能包,它使用 tidyverse 函数来读取和清理默认导出的奇怪格式的 SM 结果。
所以如果你愿意,你可以这样做:
devtools::install_github("church-army/monkeyreadr")
library(monkeyreadr)
read_sm("your-survey-monkey-data.csv")
对于后代,我将包含以下函数的主体:
read_sm()
功能default_cols <- c("Respondent ID", "Collector ID", "IP Address",
"Email Address", "First Name", "Last Name",
"Custom Data 1")
read_sm <- function(x, clean_names = TRUE, drop_surplus_cols = TRUE,
...){
## determine cleaning function from clean_names -------------------
stopifnot(length(clean_names) == 1)
if(!is.function(clean_names)){
name_cleaner <- ifelse(clean_names, janitor::make_clean_names, identity)
} else name_cleaner <- clean_names
## read sm_data ---------------------------------------------------
suppressMessages({
sm_data <- vroom::vroom(x, show_col_types = FALSE, ...)
})
missing_names <- stringr::str_detect(names(sm_data), "^\\.\\.\\.\\d+$")
sm_data <- dplyr::rename_with(sm_data, name_cleaner, everything())
## Assign correct types (where known) ----------------------------------------
default_cols <- name_cleaner(default_cols)
sm_data <-
dplyr::mutate(
sm_data,
dplyr::across(
dplyr::any_of(default_cols), as.character)
)
sm_data <-
dplyr::mutate(
sm_data,
dplyr::across(any_of(name_cleaner(c("Start Date", "End Date"))),
lubridate::mdy_hms
)
)
## Replace missing names w/ values from first row ----------------------------
first_row <- unlist(sm_data[1, ])
sm_data <- sm_data[-1, ]
repaired_names <-
name_cleaner(paste(first_row[missing_names], which(missing_names)))
old_names <- names(sm_data)[missing_names]
names(sm_data)[missing_names] <- repaired_names
if(length(repaired_names) > 0){
repaired_names_to_print <-
paste(old_names, "->", repaired_names, sep = " ")
rlang::inform(message = "Repaired names:",
class = "sm_name_repair",
body = repaired_names_to_print
)
}
## Drop surplus columns ------------------------
if(drop_surplus_cols){
all_na <- \(x) all(is.na(x))
sm_data <- dplyr::select(sm_data, -(any_of(default_cols) & where(all_na)))
}
sm_data
}