在 R 中高效解码多个热编码列

问题描述 投票:0回答:1

我有以下数据框:

id = c(1,2,3)

where_home = c(1, 0, NA)
where_work = c(0, 1, NA)

with_alone = c(0,0,0)
with_parents = c(0,1,1)
with_colleagues = c(1,1,0)

gender_male = c(1,0,1)
gender_female = c(0,1,0)

p_affect = c(10,14,20)
n_affect = c(20,30,10)


df = data.frame(id, where_home, where_work,
                with_alone, with_parents, with_colleagues,
                gender_male, gender_female, p_affect, n_affect)

有 3 个 ID,以及多个热编码列(where、with、性别)以及非热编码列(p_affect、n_affect)。

我想要的是转换热编码列,同时保持非热编码列不变。

我做了以下事情:

library(dplyr)

df_transformed <- df %>%
  rowwise() %>%
  mutate(Gender = case_when(
    gender_male == 1 ~ "Male",
    gender_female == 1 ~ "Female",
    TRUE ~ NA_character_
  ),
  Context = paste(
    ifelse(with_alone == 1, "Alone", ""),
    ifelse(with_parents == 1, "Parents", ""),
    ifelse(with_colleagues == 1, "Colleagues", ""),
    collapse = " and "
  ),
  Location = trimws(ifelse(
    where_home == 1 & where_work == 1, 
    'Home and Work', 
    paste(
      ifelse(where_home == 1, 'Home', ''),
      ifelse(where_work == 1, 'Work', '')
    )
  ))) %>%
  select(-starts_with("gender_"), -starts_with("with_"))

df_transformed <- df_transformed %>%
  select(id, Gender, Context, Location, p_affect, n_affect)

结果:

     id Gender Context               Location p_affect n_affect
  <dbl> <chr>  <chr>                 <chr>       <dbl>    <dbl>
1     1 Male   "  Colleagues"        Home           10       20
2     2 Female " Parents Colleagues" Work           14       30
3     3 Male   " Parents "           NA             20       10

这似乎可行,但有一些问题:

  • “上下文”列中的一些间距看起来很奇怪。我更喜欢一种更干净的格式,没有任何由“和”分隔的空格(例如“父母和同事”而不是“父母同事”
  • 在这种方法中,我需要分别定义每个列和每个案例,这很乏味,因为原始数据框很大,有很多列和可能的选项。我想要这样的东西:
pseudocode:

vector_of_columns_that_are_hot_encoded = c('where', 'with', 'gender')
for column in vector_of_columns:
 # modify the hot-encoded columns and make a new data frame while keeping the columns that are not in the vector_of_columns_that_are_hot_encoded as they are
# mind that some hot-encoded columns are binary (gender), while others have multiple values. If multiple values are present, put them in the data frame using "Value 1 and Value 2 and ..."

我认为必须有一种简单的方法来做到这一点。由于我是 dplyr 的初学者,如果可能的话请解释一下代码并保持简单。

r dataframe dplyr data-manipulation one-hot-encoding
1个回答
0
投票

使用现有代码,您可以应用一些后处理来调整格式:

df_transformed |> 
  mutate(
    Context = str_trim(Context),
    Context = str_replace_all(Context, " ", " and ")
  )
#> # A tibble: 3 × 6
#> # Rowwise: 
#>      id Gender Context                Location p_affect n_affect
#>   <dbl> <chr>  <chr>                  <chr>       <dbl>    <dbl>
#> 1     1 Male   Colleagues             Home           10       20
#> 2     2 Female Parents and Colleagues Work           14       30
#> 3     3 Male   Parents                <NA>           20       10
© www.soinside.com 2019 - 2024. All rights reserved.