我有一个相当复杂的数据框架结构:
ID = c(1,2,3)
Sessions = c("2023-11-14 19:01:39+01:00", "2023-11-14 20:01:39+01:00", "2023-11-14 21:01:39+01:00")
P_affect = c(10,20,30)
N_affect = c(15,30,40)
NMeals = c(0,1,2)
Meal1_Where_Home = c(NA, 1, 0)
Meal1_Where_Restaurant = c(NA, 0, 1)
Meal1_Who_Alone = c(NA, 1, 0)
Meal1_Who_Friends = c(NA, 0 , 1 )
Meal1_Type_Big_Meal = c(NA, 1, 1)
Meal1_Type_Small_Meal = c(NA, 0, 0)
Meal2_Where_Home = c(NA, NA, 1)
Meal2_Where_Restaurant = c(NA, NA, 0)
Meal2_Who_Alone = c(NA, NA, 1)
Meal2_Who_Friends = c(NA, NA , 0 )
Meal2_Type_Big_Meal = c(NA, NA, 1)
Meal2_Type_Small_Meal = c(NA, NA, 0)
Meal3_Where_Home = c(NA, NA, NA)
Meal3_Where_Restaurant = c(NA, NA, NA)
Meal3_Who_Alone = c(NA, NA, NA)
Meal3_Who_Friends = c(NA, NA , NA )
Meal3_Type_Big_Meal = c(NA, NA, NA)
Meal3_Type_Small_Meal = c(NA, NA, NA)
# Create a data frame
df1 <- data.frame(ID, Sessions, P_affect, N_affect, NMeals, Meal1_Where_Home, Meal1_Where_Restaurant,
Meal1_Who_Alone, Meal1_Who_Friends, Meal1_Type_Big_Meal, Meal1_Type_Small_Meal,
Meal2_Where_Home, Meal2_Where_Restaurant, Meal2_Who_Alone, Meal2_Who_Friends,
Meal2_Type_Big_Meal, Meal2_Type_Small_Meal, Meal3_Where_Home, Meal3_Where_Restaurant,
Meal3_Who_Alone, Meal3_Who_Friends, Meal3_Type_Big_Meal, Meal3_Type_Small_Meal)
df2 <- data.frame(
`ID` = c(1,2,3),
`Context_Family` = c(0,1,0),
`Context_Friends` = c(1,1,0),
`Context_Spouse` = c(0,1,0),
`Context_Alone` = c(0,0,1),
`Disposition_Stress` = c(0,1,0),
`Disposition_Melancholic` = c(1,1,0),
Stress = c(20,24,35)
)
df = merge(df1,df2, by = 'ID')
我想要的基本上是两个步骤:
所需输出:
ID | Sessions | P_affect | N_affect | NMeals | MealNumber | MealObs | MealValue | Context | Disposition
1 | 2023-11-14 19:01:39 | 10 | 15 | 0 | Meal1 | Where | NA | Friends | Melancholic
1 | 2023-11-14 19:01:39 | 10 | 15 | 0 | Meal1 | Who | NA | Friends | Melancholic
我尝试了步骤1:
df_modified = df %>%
pivot_longer(col=starts_with("Context"), names_to="Context", names_prefix="Context_") %>%
filter(value==1) %>%
select(-value)
但这效果不是很好,而且我还想要一种只要求列名并对所有列进行热编码转换的方法,而不是逐一进行。对于长格式:
data_long <- df %>%
pivot_longer(cols = starts_with("Meal"),
names_to = c("Meal Number", "Value"),
names_sep = "_",
values_to = "value")
这可行,但是在没有热编码值的数据集上。我添加了一个更大的数据框,只是为了检查代码是否适用于所有情况。
您的
df
数据集不整洁,因为每个变量/观察值有多个值。对于ID == 2
来说,Context
是“家人”、“朋友”和“配偶”。你会得到list-columns
:
library(dplyr)
# --------
df_modified <- select(df, ID, starts_with("Context"))
df_modified
ID Context_Family Context_Friends Context_Spouse Context_Alone
1 1 0 1 0 0
2 2 1 1 1 0
3 3 0 0 0 1
旋转:
df_modified <- pivot_longer(df_modified, -ID, names_to = "column", values_to = "value")
# A tibble: 12 × 3
ID column value
<dbl> <chr> <dbl>
1 1 Context_Family 0
2 1 Context_Friends 1
3 1 Context_Spouse 0
4 1 Context_Alone 0
5 2 Context_Family 1
6 2 Context_Friends 1
7 2 Context_Spouse 1
8 2 Context_Alone 0
9 3 Context_Family 0
10 3 Context_Friends 0
11 3 Context_Spouse 0
12 3 Context_Alone 1
#
df_modified <- mutate(
df_modified,
value = if_else(value == 1, str_extract(column, "(?<=_).*$"), NA_character_),
column = str_extract(column, "^.*(?=_)"))
# A tibble: 12 × 3
ID column value
<dbl> <chr> <chr>
1 1 Context NA
2 1 Context Friends
3 1 Context NA
4 1 Context NA
5 2 Context Family
6 2 Context Friends
7 2 Context Spouse
8 2 Context NA
9 3 Context NA
10 3 Context NA
11 3 Context NA
12 3 Context Alone
#
df_modified <- filter(df_modified, !is.na(value))
# A tibble: 5 × 3
ID column value
<dbl> <chr> <chr>
1 1 Context Friends
2 2 Context Family
3 2 Context Friends
4 2 Context Spouse
5 3 Context Alone
#
df_modified <- pivot_wider(df_modified, names_from = column, values_from = value)
Warning message:
Values from `value` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(ID, column) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
> df_modified
# A tibble: 3 × 2
ID Context
<dbl> <list>
1 1 <chr [1]>
2 2 <chr [3]>
3 3 <chr [1]>