我正在使用一些测试数据在 R 中创建一个函数,该函数将循环遍历 data.table 并检查每一列是否符合特定条件。该函数应根据已知的列名称对列进行分组,然后循环遍历这些列,并且它发现的任何不符合该列条件的值都应添加到“invalid_rows_list”中。目前我的代码中有一些额外的打印语句来帮助调试:
cat(sprintf("Invalid data found in column '%s':\n", col))
print(data[invalid_rows, .(Row = .I, Value = column_data[invalid_rows])])
使用这些打印语句和测试数据,我知道会输出什么,它没有正确标记事物。根据我掌握的数据,它应该标记第 10 行的“programyear”,因为它有 5 位数字 (20201),而不是 4 位;它应该标记第 9 行的“亚洲”列,因为它有一个 A 而不是 0、1 或 9;它应该标记第 2 行的“夏威夷”列,因为它有 2 而不是 0、1 或 9。打印语句显示它正在标记这些,但它也标记了在列名无效,即使其中只有数字值,这是唯一的要求。
该函数应该将包含无效数据的行添加到列表中,并将该列表打印到最后的表中,但即使打印语句显示找到了无效数据,列表也会打印 NULL。
这是我目前拥有的功能:
detox_validation <- function(data) {
setDT(data)
setnames(data, tolower(names(data)))
# Define columns by type and requirements
colnames <- names(data)
agencystaff_cols <- colnames[grepl("agencystaff", colnames)]
agencypurchase_cols <- colnames[grepl("agencypurchase", colnames)]
compserviceprovider_cols <- colnames[grepl("compserviceprovider(?!type)", colnames, perl = TRUE)]
purchaseprovider_cols <- colnames[grepl("purchaseprovider", colnames)]
numeric_cols <- colnames[grepl("expend|wage|hoursworked|age_", colnames)]
date_cols <- grep("date|eligibilityext|compdisenrollmsg", names(data), value = TRUE)
programyear_cols <- grep("programyear", colnames, value = TRUE)
sex_cols <- grep("sex", colnames, value = TRUE, ignore.case = TRUE)
demographic_cols <- grep("^(amerindian|asian|black|hawaiian|white|
hispanic|veteran|disability|adult|adulted|
dislocatedworker|jobcorps|wpempservice|youth|
longtermunemp|exhausttanf|fostercareyouth|
homelessorrunaway|exoffenderstatus|lowincomestatus|
englishlearner|basicskillsdeficient|culturalbarriers|
singleparent|dishomemaker)$",
colnames, value = TRUE, ignore.case = TRUE)
# Exclude columns explicitly related to funds expended from certain checks
funds_expended_cols <- colnames[grepl("fundsexpended", colnames)]
# Ensure we do not validate these columns with incorrect conditions
numeric_cols <- setdiff(numeric_cols, funds_expended_cols)
# Initialize a list to store indices of invalid rows
invalid_rows_list <- list()
# Function to check and record invalid entries
validate_and_list <- function(col_names, condition, error_message) {
for (col in col_names) {
if (!col %in% names(data)) {
cat(sprintf("Warning: Column '%s' does not exist in the data.\n", col))
next
}
column_data <- data[[col]]
invalid_rows <- which(!condition(column_data) & !is.na(column_data))
if (length(invalid_rows) > 0) {
cat(sprintf("Invalid data found in column '%s':\n", col))
print(data[invalid_rows, .(Row = .I, Value = column_data[invalid_rows])])
invalid_rows_list[[length(invalid_rows_list) + 1]] <- data.table(
Column = col,
Row_Index = invalid_rows,
Value = column_data[invalid_rows],
Message = error_message
)
}
}
}
# Apply checks
validate_and_list(numeric_cols, function(x) is.numeric(x),
"Must be numeric and not NA")
validate_and_list(date_cols, function(x) !is.na(ymd(x, quiet = TRUE)),
"Invalid date format or value. Expected YYYY-MM-DD or NA.")
validate_and_list(programyear_cols, function(x) nchar(as.character(x)) ==
4 & grepl("^\\d{4}$", x),
"Program year must be exactly four digits.")
validate_and_list(demographic_cols, function(x) x %in% c(0, 1, 9),
"Demographic values must be numeric and either 0, 1, or 9")
validate_and_list(sex_cols, function(x) x %in% c(1, 2),
"Sex values must be numeric and either 1 or 2")
# Combine all invalid entries into a single data.table
if (length(invalid_rows_list) > 0) {
all_invalid_rows <- rbindlist(invalid_rows_list)
return(all_invalid_rows)
} else {
return(NULL)
}
}
这是我用来检查功能的数据
> dput(check_val)
structure(list(v1 = 1:10, programyear = c(2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 20201), agencycode = c(6L, 35L,
52L, 42L, 48L, 48L, 48L, 91L, 91L, 91L), applicationdate = structure(c(18444,
18449, 16743, 18548, 14551, 17403, 12241, 14886, 15216, 15805
), class = "Date"), sex = c(1, 2, 1, 1, 1, 1, 1, 2, 2, 1), amerindian = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), asian = c("0", "0", "0", "0", "0",
"0", "0", "0", "A", "0"), black = c(1, 0, 1, 1, 1, 0, 0, 0, 0,
0), hawaiian = c(0, 2, 0, 0, 0, 0, 0, 0, 0, 0), white = c(0,
1, 0, 0, 1, 1, 1, 1, 1, 1), hispanic = c(0, 0, 0, 0, 0, 1, 0,
0, 0, 0), veteran = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), referral = c(19L,
29L, 1L, 9L, 16L, 29L, 19L, 16L, 16L, 19L), student = c(3L, 0L,
0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L), eligibilitydate = structure(c(18467,
18507, NA, NA, 14573, 17423, 12262, 14886, 15217, 15805), class = "Date"),
eligibilityext = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date"), oosplacementdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), oosexitdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), disability = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1), primdisability = c(17L, 19L,
18L, 0L, 2L, 1L, 2L, 1L, 1L, 1L), primdisabilitycause = c(34L,
2L, 13L, 0L, 30L, 30L, 30L, 10L, 0L, 13L), seconddisability = c(0L,
19L, NA, 0L, 8L, 0L, 13L, 0L, 0L, 0L), seconddisabilitycause = c(0L,
18L, NA, 0L, 0L, 0L, 36L, 0L, 0L, 0L), disabilitysigcode = c(1L,
1L, 2L, 0L, 2L, 2L, 2L, 1L, 1L, 1L), twestartdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), tweenddate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), ipesupportedempgoal = c(0L,
0L, 1L, NA, 0L, 0L, 1L, 0L, 0L, 0L), ipeempstatus = c(8L,
NA, 10L, NA, 7L, 8L, 10L, 8L, 8L, 8L), ipeprimaryocc = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), ipehourlywage = c(0,
NA, 0, 0, 0, 0, 0, 0, 0, 0), ipeweeklyhoursworked = c(0,
NA, 0, NA, 0, 0, 0, 0, 0, 0), adult = c(0, 9, 0, 9, 0, 0,
0, 0, 0, 0), adulted = c(0, 9, 0, 9, 9, 0, 9, 0, 0, 0), dislocatedworker = c(0,
9, 0, 9, 0, 0, 0, 0, 0, 0), jobcorps = c(0, 9, 0, 9, 9, 0,
0, 0, 0, 0), vocrehab = c(1L, NA, 0L, NA, 1L, 1L, 1L, 0L,
1L, 0L), wpempservice = c(0, 9, 0, 9, 9, 9, 9, 0, 0, 0),
youth = c(0, 9, 0, 9, 0, 0, 0, 0, 0, 0), youthbuild = c("",
"NULL", "", "", "", "", "", "", "", ""), longtermunemp = c(0,
9, 0, 9, 0, 0, 0, 1, 1, 0), exhausttanf = c(9, 9, 0, 9, 0,
9, 0, 0, 0, 0), fostercareyouth = c(0, 9, 0, 9, 0, 0, 0,
0, 0, 0), homelessorrunaway = c(0, 9, 0, 9, 0, 0, 0, 0, 0,
0), exoffenderstatus = c(0, 9, 0, 9, 9, 0, 9, 0, 0, 0), lowincomestatus = c(1,
9, 0, 9, 0, 1, 0, 0, 0, 1), englishlearner = c(0, 9, 0, 9,
0, 0, 0, 0, 0, 0), basicskillsdeficient = c(0, 9, 0, 9, 0,
0, 0, 0, 0, 0), culturalbarriers = c(0, 9, 0, 9, 9, 9, 9,
0, 0, 9), singleparent = c(9, 9, 0, 9, 9, 1, 9, 0, 0, 0),
dishomemaker = c(0, 9, 0, 9, 0, 0, 0, 0, 0, 0), migrantfarmworker = c(0L,
NA, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L), statedisstudentagerange = c("16;22",
"14;21", "14;21", "", "14;22", "14;22", "14;22", "14;21",
"14;21", "14;21"), schoolgradecompleted = c(12L, NA, 12L,
NA, 0L, 0L, 12L, 12L, 9L, 9L), insecondaryed = c(0, 0, 0,
0, 0, 0, 0, 0, 1, 1), specialedcertcompdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), secschooldiplomadate = structure(c(NA,
NA, 11855, NA, 17682, NA, 15095, 17344, 18068, NA), class = "Date"),
geddate = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), class = "Date"), enrolledinpostseced = c(NA, NA, 0L, NA,
0L, 0L, 1L, 1L, 1L, 0L), credprogramenrolldate = structure(c(NA,
NA, NA, NA, NA, NA, NA, 17416, 18506, NA), class = "Date"),
completedsomepostseced = c(9, 9, 0, 9, 0, 0, 0, 1, 0, 0),
associatedegreedate = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date"), bachelordegreedate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), mastersdegreedate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), degreeabovemastersdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), vtlicensedate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), vtcertificatedate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), otherlicorcertdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), petsstartdate = structure(c(18428,
NA, NA, NA, 17707, NA, NA, 16025, NA, 15895), class = "Date"),
jecvragencystaff = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), jecvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jecpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jecvrservicepurchaseexpenditure = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), wblevragencystaff = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), wblevragencypurchase = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), wblepurchaseprovidertype = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), wblevrservicepurchaseexpenditure = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ceovragencystaff = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), ceovragencypurchase = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), ceopurchaseprovidertype = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), ceovrservicepurchaseexpenditure = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), wrtvragencystaff = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), wrtvragencypurchase = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), wrtpurchaseprovidertype = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), wrtvrservicepurchaseexpenditure = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), isavragencystaff = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), isavragencypurchase = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), isapurchaseprovidertype = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), isavrservicepurchaseexpenditure = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), vrservicestartdate = structure(c(NA,
NA, 16835, NA, 14670, 17458, 12262, 16414, 16463, 15895), class = "Date"),
careerservicedate = structure(c(18715, 18489, NA, NA, 18271,
18411, 17774, 18660, NA, NA), class = "Date"), gcutvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), gcutpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), gcutvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), gcutcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), gcutcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), fycutvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), fycutpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), fycutvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), fycutcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), fycutcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), jcctvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jcctpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jcctvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jcctcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jcctcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), ovtvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ovtvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ovtpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ovtvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ovtcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ovtcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), ojtvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ojtvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ojtpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ojtvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ojtcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ojtcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), ratvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ratpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ratvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ratcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), ratcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), barltvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), barltvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), barltpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), barltvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), barltcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), barltcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), jrtvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jrtvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jrtpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jrtvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jrtcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jrtcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), drstvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), drstvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), drstpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), drstvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), drstcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), drstcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), mtvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), mtvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mtpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mtvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mtcompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mtcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), rsetvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), rsetvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rsetpurchaseprovidertype = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
), rsetvrtitleifundsexpended = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), rsetcompserviceprovider = c(FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), rsetcompserviceprovidertype = c("NULL",
"NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL",
"NULL"), ctvragencystaff = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), ctvragencypurchase = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
ctpurchaseprovidertype = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
ctvrtitleifundsexpended = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
ctcompserviceprovider = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
ctcompserviceprovidertype = c("NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL", "NULL", "NULL"), assvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), assvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), asspurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), assvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), asscompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), asscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), dtivragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), dtivragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), dtipurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), dtivrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), dticompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), dticompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), vrcgvragencystaff = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 1), vrcgvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), vrcgpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), vrcgvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), vrcgcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), vrcgcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), jsavragencystaff = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), jsavragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jsapurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jsavrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jsacompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jsacompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), jpavragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jpavragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jpapurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), jpavrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jpacompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), jpacompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), stjsvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), stjsvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), stjspurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), stjsvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), stjscompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), stjscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), sesvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), sesvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), sespurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), sesvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), sessetitlevifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), sescompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), sescompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), irsvragencystaff = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), irsvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), irspurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), irsvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), irscompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), irscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), bcvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), bcvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), bcpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), bcvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), bccompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), bccompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), cesvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), cesvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), cespurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), cesvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), cessetitlevifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), cescompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), cescompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), esvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), esvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), espurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), esvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), essetitlevifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), tranvragencystaff = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), tranvragencypurchase = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), tranpurchaseprovidertype = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), tranvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), trancompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), trancompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), mntvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), mntvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mntpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), mntvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), mntcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), mntcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), rtvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), rtvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rtpurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rtvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rtcompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rtcompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), pasvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), pasvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), paspurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), pasvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), pascompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), pascompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), tasvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), tasvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), taspurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), tasvrtitleifundsexpended = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), tascompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), tascompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), rsvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), rsvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rspurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rsvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rscompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), rscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), isvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), isvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ispurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), isvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), iscompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), iscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), osvragencystaff = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), osvragencypurchase = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ospurchaseprovidertype = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), osvrtitleifundsexpended = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), oscompserviceprovider = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), oscompserviceprovidertype = c("",
"NULL", "", "", "", "", "", "", "", ""), edfuncleveldate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), secondarydate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), postsectransreportcarddate = structure(c(NA,
NA, NA, NA, NA, NA, NA, 18295, NA, NA), class = "Date"),
trainingmilestonedate = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date"), skillgainskillsprogdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), eoprimoccupationstartdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), exitdate = structure(c(18738,
18535, 18501, 18563, 18704, 18681, 18505, 18590, 18648, 18528
), class = "Date"), exittype = c(4L, 3L, 4L, 0L, 4L, 4L,
4L, 4L, 4L, 4L), exitreason = c(18L, 19L, 18L, 19L, 17L,
17L, 17L, 2L, 18L, 17L), exitempoutcome = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), exitprimoccupation = c("", "NULL",
"", "", "", "", "", "", "", ""), exithourlywage = c(0, NA,
0, 0, 0, 0, 0, 0, 0, 0), exitweeklyhoursworked = c(0, NA,
0, 0, 0, 0, 0, 0, 0, 0), pecredprogramenrolldate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), pecredattainmentdate = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), pecredentialtype = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), appmonthlypubsup = c("0",
"0", "_0", "", "0", "2", "0", "0", "0", "2"), appmedinscov = c("7",
"1", "_0", "", "0", "1", "0", "7", "7", "1"), exitmonthlypubsup = c("0",
"0", "_0", "", "0", "2", "2", "0", "0", "2"), exitmedinscov = c("7",
"1", "_0", "0", "0", "1", "0", "7", "7", "1"), ipeinitialdate = structure(c(18473,
NA, 16835, NA, 14656, 17458, 12262, 16414, 16463, 15895), class = "Date"),
ipeextensiondate = structure(c(NA, 18515, NA, NA, NA, NA,
NA, NA, NA, NA), class = "Date"), enrolledinsecequiv = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), compdisenrollmsg = structure(c(NA,
NA, NA, NA, NA, NA, NA, 18586, NA, NA), class = "Date"),
wblevvragencystaff = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), wblevvragencypurchase = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), wblevpurchaseprovidertype = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), wblevvrtitleifundsexpended = c(0,
0, 0, 0, 0, 111111, 0, 0, 0, -111111), wblevcompserviceprovider = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), wblevcompserviceprovidertype = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), age_app = c(18, 0, 0,
6, 9, 9, 9, 10, 10, 10), age_ipe = c(18, NA, 0, NA, 10, 10,
9, 15, 14, 10), age_preets = c(18, NA, NA, NA, 18, NA, NA,
14, NA, 10), age_exit = c(19, 0, 4, 6, 21, 13, 26, 21, 20,
17), age_vrservice = c(NA, NA, 0, NA, 10, 10, 9, 15, 14,
10)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000021fcf475940>)
我怀疑问题与存在于错误环境中的变量有关。尝试重写
函数,使其显式返回某些内容,而不是依赖它默默地更改validate_and_list
的值,该值仅在该函数外部定义。invalid_rows_list
@AndrewGustar 是对的。
您需要
<<-
运算符:
# # Function to check and record invalid entries
# validate_and_list <- function(col_names, condition, error_message) {
# for (col in col_names) {
# if (!col %in% names(data)) {
# cat(sprintf("Warning: Column '%s' does not exist in the data.\n", col))
# next
# }
# column_data <- data[[col]]
# invalid_rows <- which(!condition(column_data) & !is.na(column_data))
# if (length(invalid_rows) > 0) {
# cat(sprintf("Invalid data found in column '%s':\n", col))
# print(data[invalid_rows, .(Row = .I, Value = column_data[invalid_rows])])
invalid_rows_list[[length(invalid_rows_list) + 1]] <<- data.table( # <--- HERE
# Column = col,
# Row_Index = invalid_rows,
# Value = column_data[invalid_rows],
# Message = error_message
# )
# }
# }
# }
你看,
invalid_rows_list
是在for
循环之上定义的一个环境。
此后的输出:
output <- detox_validation(check_val)
> output
Column Row_Index Value Message
<char> <int> <char> <char>
1: programyear 10 20201 Program year must be exactly four digits.
2: asian 9 A Demographic values must be numeric and either 0, 1, or 9
3: hawaiian 2 2 Demographic values must be numeric and either 0, 1, or 9
4: gcutvrtitleifundsexpended 1 0 Sex values must be numeric and either 1 or 2
5: gcutvrtitleifundsexpended 2 0 Sex values must be numeric and either 1 or 2
---
349: wblevvrtitleifundsexpended 6 111111 Sex values must be numeric and either 1 or 2
350: wblevvrtitleifundsexpended 7 0 Sex values must be numeric and either 1 or 2
351: wblevvrtitleifundsexpended 8 0 Sex values must be numeric and either 1 or 2
352: wblevvrtitleifundsexpended 9 0 Sex values must be numeric and either 1 or 2
353: wblevvrtitleifundsexpended 10 -111111 Sex values must be numeric and either 1 or 2