使用 R Haven 包读取数据时,为什么 df$VARNAME 与 df[,i] 不一样?

问题描述 投票:0回答:2

上下文:我已经使用

haven
包读取了 PISA 2022 数据,现在我想创建一个由三列组成的辅助 df:

  • 变量名称(例如 EFFORT1)
  • 变量标签(例如,你在这次测试中投入了多少努力?”
  • 变量值(例如 1、2、3、...)

问题是,如果我输入

attributes(pisa_df$EFFORT1)
,则可以访问标签和值,但如果我输入
attributes(pisa_df[,i])
,则不能访问。为什么会这样,有没有办法解决这个问题?我有超过 1000 个变量,因此不能一一输入它们。我尝试过类似
pisa_df$get(colnames(pisa_df)[i])
的方法,但当然行不通。

这似乎是一个非常新手的问题,但我什至不知道如何来搜索可能的答案。预先感谢!

r vector atomic r-haven
2个回答
2
投票

前面,

attributes(pisa_df$EFFORT1)
有效但
attributes(pisa_df[,1])
不起作用的原因是 为什么从数据框与小标题中对列进行子集化会产生不同的结果。也就是说,在原生 R 中,当减少到单列时
[.data.frame
会下降到向量,但
tbl_df
不会。基数
[
可以选择通过添加 drop=FALSE 参数来
简化为向量。

mt <- mtcars[1:3,]
mt[,1]
# [1] 21.0 21.0 22.8
mt[,1, drop=FALSE]
#                mpg
# Mazda RX4     21.0
# Mazda RX4 Wag 21.0
# Datsun 710    22.8
tibble(mt)[,1]
# # A tibble: 3 × 1
#     mpg
#   <dbl>
# 1  21  
# 2  21  
# 3  22.8

解决方法是使用

$
与名称或
[[
与列索引,

mt[[1]]
# [1] 21.0 21.0 22.8
tibble(mt)[[1]]
# [1] 21.0 21.0 22.8

就您而言,处理 SAS 文件只需花费很少的精力即可满足您的需求。使用“学校调查问卷文件”(我很容易得到),我们可以做如下的事情。

在前面,我演示了如何获取几列的标签和唯一值。有些列都是唯一的(例如,

SCH
有 21,629 行,而
CNTSCHID
列有 21,629 个不同的值),所以我不确定这是否对您来说有趣。无论如何,虽然我选择了一些,但您可以毫无问题地将其用于所有这些。

此外,有些值是

character
,有些是
numeric
,所以我们必须将所有数字转换为字符串,或者我们有两个单独的列。我会选择后者进行演示,因为我认为将所有内容转换为字符串会更容易让您自己适应。

SCH <- haven::read_sas(unz("SCH_QQQ_SAS.zip", "cy08msp_sch_qqq.sas7bdat"))
library(dplyr)
columns <- c(1, 3, 4, 5)
quux <- lapply(columns, function(ind) {
  out <- tibble(column = names(SCH)[ind], label = attributes(SCH[[ind]])$label)
  if (is.character(SCH[[ind]])) {
    cbind(out, tibble(values_chr = unique(SCH[[ind]]))) 
  } else cbind(out, tibble(values_num = unique(SCH[[ind]])))
}) |>
  bind_rows() |>
  tibble()
quux
# # A tibble: 21,794 × 4
#    column label                    values_chr values_num
#    <chr>  <chr>                    <chr>           <dbl>
#  1 CNT    Country code 3-character ALB                NA
#  2 CNT    Country code 3-character QAZ                NA
#  3 CNT    Country code 3-character ARG                NA
#  4 CNT    Country code 3-character AUS                NA
#  5 CNT    Country code 3-character AUT                NA
#  6 CNT    Country code 3-character BEL                NA
#  7 CNT    Country code 3-character BRA                NA
#  8 CNT    Country code 3-character BRN                NA
#  9 CNT    Country code 3-character BGR                NA
# 10 CNT    Country code 3-character KHM                NA
# # ℹ 21,784 more rows
# # ℹ Use `print(n = ...)` to see more rows

这个概念是,如果特定列是

character
,那么您将使用
values_chr
(无论您正在做什么工作)。如果您仅选择
character
列,那么您可以放弃
if
/
else
并只输出
values
不同的字符串。

如果需要的话,无需

dplyr
,只需多付出一点努力即可完成此操作。


0
投票

考虑示例数据(最后的

dput
):

> aux_df
# A tibble: 3 × 6
  EFFORT1 EFFORT2 OCOD1 OCOD2 OCOD3 PROGN   
    <dbl>   <dbl> <chr> <chr> <chr> <chr>   
1      10      10 243   9412  9999  00080002
2       9       8 8189  9999  9999  00080001
3      10      10 9999  9999  9999  0008000

让我们用

purrr
map
pluck
attr_getter
创建一个带有名称和标签的小标题:

aux_labels <- map_dfr(
  colnames(aux_df), 
  \(x) tibble(
    column = x, 
    label = pluck(aux_df, x, attr_getter("label"))))

名称和标签输出:

> aux_labels
# A tibble: 6 × 2
  column  label                                                                   
  <chr>   <chr>                                                                   
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment)
2 EFFORT2 How much effort would you have invested? (after cognitive assessment)   
3 OCOD1   ISCO-08 Occupation code - Mother                                        
4 OCOD2   ISCO-08 Occupation code - Father                                        
5 OCOD3   ISCO-08 Occupation code - Self                                          
6 PROGN   Unique national study programme code  

现在,选择一列并将其与相应的值合并:

aux_output <- cross_join(
  filter(aux_labels, column == "EFFORT1"), 
  select(aux_df, value = "EFFORT1"))

输出:

> aux_output
# A tibble: 3 × 3
  column  label                                                                    value
  <chr>   <chr>                                                                    <dbl>
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment)    10
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment)     9
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment)    10

如果您想一次进入某些(或所有)列,考虑到可能的不同类别,您可以尝试以下操作:

# Choose some columns (or all: colnames(aux_df))
aux_columns <- c("EFFORT1", "EFFORT2", "OCOD1")

#
aux_output <- map_dfr(
  aux_columns,
  \(x) cross_join(
    filter(aux_labels, column == x), 
    aux_df %>% 
      select(value = x) %>% 
      mutate(value_class = class(value), value = as.character(value))))

输出:

> aux_output
# A tibble: 9 × 4
  column  label                                                                    value value_class
  <chr>   <chr>                                                                    <chr> <chr>      
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10    numeric    
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 9     numeric    
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10    numeric    
4 EFFORT2 How much effort would you have invested? (after cognitive assessment)    10    numeric    
5 EFFORT2 How much effort would you have invested? (after cognitive assessment)    8     numeric    
6 EFFORT2 How much effort would you have invested? (after cognitive assessment)    10    numeric    
7 OCOD1   ISCO-08 Occupation code - Mother                                         243   character  
8 OCOD1   ISCO-08 Occupation code - Mother                                         8189  character  
9 OCOD1   ISCO-08 Occupation code - Mother                                         9999  character 

就是这样。这是示例数据的

dput
:

aux_df <- structure(list(
  EFFORT1 = structure(
    c(10, 9, 10), 
    label = "How much effort did you put into this test? (after cognitive assessment)"), 
  EFFORT2 = structure(
    c(10, 8, 10), 
    label = "How much effort would you have invested? (after cognitive assessment)"), 
  OCOD1 = structure(
    c("243", "8189", "9999"), 
    label = "ISCO-08 Occupation code - Mother"), 
  OCOD2 = structure(
    c("9412", "9999", "9999"), 
    label = "ISCO-08 Occupation code - Father"), 
  OCOD3 = structure(
    c("9999", "9999", "9999"), 
    label = "ISCO-08 Occupation code - Self"), 
  PROGN = structure(
    c("00080002", "00080001", "00080001"), 
    label = "Unique national study programme code")), 
  row.names = c(NA, -3L), 
  class = c("tbl_df", "tbl", "data.frame"))
© www.soinside.com 2019 - 2024. All rights reserved.