使用R的dplyr将多种格式的大小分组为小型,中型,大型

问题描述 投票:2回答:2

这是一个示例数据集:

df <- tibble(
  size = c("l", "L/Black", "medium", "small", "large", "L/White", "s", 
       "L/White", "M", "S/Blue", "M/White", "L/Navy", "M/Navy", "S"),
  shirt = c("blue", "black", "black", "black", "white", "white", "purple",
        "white", "purple", "blue", "white", "navy", "navy", "navy")
)

上面的数据集有一个size列,其中显示了基础:smallmediumlarge。但它也有其他代表那些尺寸,如M,或S/Blue,或s

我想用最有效的方法制作smallmediumlarge,并摆脱size类别中的颜色。例如。将L/Black等同于large

我可以使用gsub多次这样做,但我想知道是否有比我最初的想法更有效的方法。我的数据集长达数千行,下面的代码示例很难写:

df$size <- df$size %>%
 gsub("M", "medium", .) %>%
 gsub("mediumedium", "medium", .) %>%
 gsub("S", "small", .) %>%
 gsub("smallmall", "small", .) %>%
 gsub("L", "large", .) %>%
 gsub("S/Blue", "small", .) %>%
 gsub("L/Navy", "large", .) 

这种方法不能很好地工作,因为它介绍了smallmallmediumedium之类的东西,当通过上面的前两个gsub时。标准化三种主要尺寸的一切的最佳方法是什么?

r string dplyr
2个回答
1
投票

使用tidyverse的解决方案。

library(tidyverse)

df2 <- df %>%
  # Remove color
  mutate(size = map2_chr(size, shirt, ~str_replace(.x, fixed(.y, ignore_case = TRUE), ""))) %>%
  # Remove /
  mutate(size = str_replace(size, fixed("/"), "")) %>%
  # Replacement
  mutate(size = case_when(
    size %in% "l" | size %in% "L"    ~ "large",
    size %in% "m" | size %in% "M"    ~ "medium",
    size %in% "s" | size %in% "S"    ~ "small",
    TRUE                             ~ size
  ))
df2
# # A tibble: 14 x 2
#    size   shirt 
#    <chr>  <chr> 
#  1 large  blue  
#  2 large  black 
#  3 medium black 
#  4 large  black 
#  5 large  white 
#  6 large  white 
#  7 small  purple
#  8 large  white 
#  9 medium purple
# 10 small  blue  
# 11 medium white 
# 12 large  navy  
# 13 medium navy  
# 14 small  navy 

1
投票
library("tidyverse")

df %>%
  # Extract the alphanum substring at the start of "size"
  extract(size, "size2", regex = "^(\\w*)", remove = FALSE) %>%
  # All lowercase in case there are sizes like "Small"
  # And then recode as required.
  # Here "l" = "large" means take all occurrences of "l" and
  # recode them as "large", etc.
  mutate(size3 = recode(tolower(size2),
                        "l" = "large",
                        "m" = "medium",
                        "s" = "small"))
# # A tibble: 14 x 4
#   size    size2  shirt  size3
#   <chr>   <chr>  <chr>  <chr>
# 1 l       l      blue   large
# 2 L/Black L      black  large
# 3 medium  medium black  medium
# 4 small   small  black  small
# 5 large   large  white  large

当然,您不需要三个大小的列。我使用了不同的列名,因此很明显每个转换都会实现。

© www.soinside.com 2019 - 2024. All rights reserved.