上下文:我已经使用
haven
包读取了 PISA 2022 数据,现在我想创建一个由三列组成的辅助 df:
问题是,如果我输入
attributes(pisa_df$EFFORT1)
,则可以访问标签和值,但如果我输入 attributes(pisa_df[,i])
,则不能访问。为什么会这样,有没有办法解决这个问题?我有超过 1000 个变量,因此不能一一输入它们。我尝试过类似pisa_df$get(colnames(pisa_df)[i])
的方法,但当然行不通。
这似乎是一个非常新手的问题,但我什至不知道如何来搜索可能的答案。预先感谢!
前面,
attributes(pisa_df$EFFORT1)
有效但 attributes(pisa_df[,1])
不起作用的原因是 为什么从数据框与小标题中对列进行子集化会产生不同的结果。也就是说,在原生 R 中,当减少到单列时 [.data.frame
会下降到向量,但 tbl_df
不会。基数 [
可以选择通过添加 drop=FALSE
参数来不简化为向量。
mt <- mtcars[1:3,]
mt[,1]
# [1] 21.0 21.0 22.8
mt[,1, drop=FALSE]
# mpg
# Mazda RX4 21.0
# Mazda RX4 Wag 21.0
# Datsun 710 22.8
tibble(mt)[,1]
# # A tibble: 3 × 1
# mpg
# <dbl>
# 1 21
# 2 21
# 3 22.8
解决方法是使用
$
与名称或 [[
与列索引,
mt[[1]]
# [1] 21.0 21.0 22.8
tibble(mt)[[1]]
# [1] 21.0 21.0 22.8
就您而言,处理 SAS 文件只需花费很少的精力即可满足您的需求。使用“学校调查问卷文件”(我很容易得到),我们可以做如下的事情。
在前面,我演示了如何获取几列的标签和唯一值。有些列都是唯一的(例如,
SCH
有 21,629 行,而 CNTSCHID
列有 21,629 个不同的值),所以我不确定这是否对您来说有趣。无论如何,虽然我选择了一些,但您可以毫无问题地将其用于所有这些。
此外,有些值是
character
,有些是 numeric
,所以我们必须将所有数字转换为字符串,或者我们有两个单独的列。我会选择后者进行演示,因为我认为将所有内容转换为字符串会更容易让您自己适应。
SCH <- haven::read_sas(unz("SCH_QQQ_SAS.zip", "cy08msp_sch_qqq.sas7bdat"))
library(dplyr)
columns <- c(1, 3, 4, 5)
quux <- lapply(columns, function(ind) {
out <- tibble(column = names(SCH)[ind], label = attributes(SCH[[ind]])$label)
if (is.character(SCH[[ind]])) {
cbind(out, tibble(values_chr = unique(SCH[[ind]])))
} else cbind(out, tibble(values_num = unique(SCH[[ind]])))
}) |>
bind_rows() |>
tibble()
quux
# # A tibble: 21,794 × 4
# column label values_chr values_num
# <chr> <chr> <chr> <dbl>
# 1 CNT Country code 3-character ALB NA
# 2 CNT Country code 3-character QAZ NA
# 3 CNT Country code 3-character ARG NA
# 4 CNT Country code 3-character AUS NA
# 5 CNT Country code 3-character AUT NA
# 6 CNT Country code 3-character BEL NA
# 7 CNT Country code 3-character BRA NA
# 8 CNT Country code 3-character BRN NA
# 9 CNT Country code 3-character BGR NA
# 10 CNT Country code 3-character KHM NA
# # ℹ 21,784 more rows
# # ℹ Use `print(n = ...)` to see more rows
这个概念是,如果特定列是
character
,那么您将使用 values_chr
(无论您正在做什么工作)。如果您仅选择 character
列,那么您可以放弃 if
/else
并只输出 values
不同的字符串。
如果需要的话,无需
dplyr
,只需多付出一点努力即可完成此操作。
考虑示例数据(最后的
dput
):
> aux_df
# A tibble: 3 × 6
EFFORT1 EFFORT2 OCOD1 OCOD2 OCOD3 PROGN
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 10 10 243 9412 9999 00080002
2 9 8 8189 9999 9999 00080001
3 10 10 9999 9999 9999 0008000
让我们用
purrr
map
、pluck
和 attr_getter
创建一个带有名称和标签的小标题:
aux_labels <- map_dfr(
colnames(aux_df),
\(x) tibble(
column = x,
label = pluck(aux_df, x, attr_getter("label"))))
名称和标签输出:
> aux_labels
# A tibble: 6 × 2
column label
<chr> <chr>
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment)
2 EFFORT2 How much effort would you have invested? (after cognitive assessment)
3 OCOD1 ISCO-08 Occupation code - Mother
4 OCOD2 ISCO-08 Occupation code - Father
5 OCOD3 ISCO-08 Occupation code - Self
6 PROGN Unique national study programme code
现在,选择一列并将其与相应的值合并:
aux_output <- cross_join(
filter(aux_labels, column == "EFFORT1"),
select(aux_df, value = "EFFORT1"))
输出:
> aux_output
# A tibble: 3 × 3
column label value
<chr> <chr> <dbl>
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 9
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10
如果您想一次进入某些(或所有)列,考虑到可能的不同类别,您可以尝试以下操作:
# Choose some columns (or all: colnames(aux_df))
aux_columns <- c("EFFORT1", "EFFORT2", "OCOD1")
#
aux_output <- map_dfr(
aux_columns,
\(x) cross_join(
filter(aux_labels, column == x),
aux_df %>%
select(value = x) %>%
mutate(value_class = class(value), value = as.character(value))))
输出:
> aux_output
# A tibble: 9 × 4
column label value value_class
<chr> <chr> <chr> <chr>
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10 numeric
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 9 numeric
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10 numeric
4 EFFORT2 How much effort would you have invested? (after cognitive assessment) 10 numeric
5 EFFORT2 How much effort would you have invested? (after cognitive assessment) 8 numeric
6 EFFORT2 How much effort would you have invested? (after cognitive assessment) 10 numeric
7 OCOD1 ISCO-08 Occupation code - Mother 243 character
8 OCOD1 ISCO-08 Occupation code - Mother 8189 character
9 OCOD1 ISCO-08 Occupation code - Mother 9999 character
就是这样。这是示例数据的
dput
:
aux_df <- structure(list(
EFFORT1 = structure(
c(10, 9, 10),
label = "How much effort did you put into this test? (after cognitive assessment)"),
EFFORT2 = structure(
c(10, 8, 10),
label = "How much effort would you have invested? (after cognitive assessment)"),
OCOD1 = structure(
c("243", "8189", "9999"),
label = "ISCO-08 Occupation code - Mother"),
OCOD2 = structure(
c("9412", "9999", "9999"),
label = "ISCO-08 Occupation code - Father"),
OCOD3 = structure(
c("9999", "9999", "9999"),
label = "ISCO-08 Occupation code - Self"),
PROGN = structure(
c("00080002", "00080001", "00080001"),
label = "Unique national study programme code")),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame"))