数据清理存在差异的长数据

问题描述 投票:0回答:1

我有以下数据:

# Creating the dataframe
df <- data.frame(
  patient_id = c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6),
  dob = c("6/16/1926", "6/16/1926", "6/16/1926", "12/6/1935", "12/6/1935", "5/18/1938", "5/18/1938", "7/18/1944", "2/3/1949", "2/3/1949", "11/27/1960", "11/27/1960", "11/27/1960"),
  sex = c("Female", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male"),
  race = c("Black or African American", NA, "White", NA, "White", "Asian", "White", "White", "Other", "White", NA, "White", "White")
)

# Displaying the dataframe
print(df)

有些患者的种族栏存在差异。如果有 2 个或多个条目,其中一个为 NA,我需要将 NA 条目替换为第一个非 NA 值。如果有 2 个或多个不相同的非 NA 条目,我需要将所有条目替换为“混合种族”。我怎样才能在 R tidyverse 中做到这一点?

我已经尝试过:

# Replace NA race values with the other race value if available
df<- df%>%
  group_by(patient_id) %>%
  mutate(
    race = ifelse(
      any(!is.na(race) & race != ""), 
      ifelse(all(is.na(race) | race == ""), NA, first(na.omit(race))), 
      race)
  )

# Update the race column to "Mixed Race" only if multiple races are found for the same patient
df<- df%>%
  group_by(patient_id) %>%
  mutate(
    race = ifelse(
      n_distinct(race) > 1, 
      "Mixed Race", 
      race)
  )

第一个将所有值替换为“白人”,第二个将所有值替换为“混合种族”。

我也尝试过:

# Update the table to replace NA values in the race column
patients_updated <- df%>%
  group_by(patient_id) %>%
  mutate(race = ifelse(any(!is.na(race)), first(na.omit(race)), race))

# Replace NA values in race column with corresponding non-NA race value for each patient
df<- df%>%
  group_by(patient_id) %>%
  mutate(race = ifelse(any(!is.na(race)), na.omit(race), race))

但我得到了相同的结果。

tidyverse
1个回答
0
投票

只要您首先处理

Mixed Race
,您的标记
NA
的方法就会有效。

library(tidyverse)

df |>
  group_by(patient_id) |>
  fill(race, .direction = 'downup') |>
  mutate(
    race = ifelse(
      n_distinct(race) > 1, 
      "Mixed Race", 
      race)
  )
#> # A tibble: 13 × 4
#> # Groups:   patient_id [6]
#>    patient_id dob        sex    race      
#>         <dbl> <chr>      <chr>  <chr>     
#>  1          1 6/16/1926  Female Mixed Race
#>  2          1 6/16/1926  Female Mixed Race
#>  3          1 6/16/1926  Female Mixed Race
#>  4          2 12/6/1935  Female White     
#>  5          2 12/6/1935  Female White     
#>  6          3 5/18/1938  Male   Mixed Race
#>  7          3 5/18/1938  Male   Mixed Race
#>  8          4 7/18/1944  Male   White     
#>  9          5 2/3/1949   Female Mixed Race
#> 10          5 2/3/1949   Female Mixed Race
#> 11          6 11/27/1960 Male   White     
#> 12          6 11/27/1960 Male   White     
#> 13          6 11/27/1960 Male   White

如果您要每

distinct()
 查找一行,请按照 
patient_id

进行操作
df |>
  group_by(patient_id) |>
  fill(race, .direction = 'downup') |>
  mutate(
    race = ifelse(
      n_distinct(race) > 1, 
      "Mixed Race", 
      race)
  ) |>
  distinct()
#> # A tibble: 6 × 4
#> # Groups:   patient_id [6]
#>   patient_id dob        sex    race      
#>        <dbl> <chr>      <chr>  <chr>     
#> 1          1 6/16/1926  Female Mixed Race
#> 2          2 12/6/1935  Female White     
#> 3          3 5/18/1938  Male   Mixed Race
#> 4          4 7/18/1944  Male   White     
#> 5          5 2/3/1949   Female Mixed Race
#> 6          6 11/27/1960 Male   White

创建于 2024-05-08,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.